Implementing Publisher/Subscriber bus for two Azure table storage - azure

I have two Azure storage tables:
Table 1: This table is being updated regularly with a background task.
Table 2: Contains a subset of entities of table 1, but it need to be updated whenever there is a change in table 1.
| Table 2 | | Table 1 |
| | | A |
| B | | B |
| C | <=> | C |
| D | | D |
| | | E |
| | | F |
Basically what I want to achieve here is that Table 1 should always be listening to table 2 and whenever I add an entity to table 2, table 1 should know that I am interested in tracking that item and update both entities when there is an update available.

Well, here is a suggestion:
Have the background task send a message to a Service Bus Queue containing a reference to the changed entity
Make an Azure Function that listens to the queue, and it can check if an entity matching the updated one exists in the other table
If one does, it can update it
More info: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-service-bus
Support might be added in future for these events in Azure Event Grid.
When that happens, it can remove the need to post events about table changes, and you would only need the Event Grid subscription + Azure Function.

Related

Azure data factory - large dependency pipeline

I have a "master" pipeline in Azure Data factory, which looks like this:
One rectangle is Execute pipeline activity for 1 destination (target) Table, so this "child" pipeline takes some data, transform it and save as a specified table. Essentialy this means that before filling table on the right, we have to fill previous (connected with line) tables.
The problem is that this master pipeline contains more than 100 activities and the limit for data factory pipeline is 40 activities.
I was thinking about dividing pipeline into several smaller pipelines (i.e. first layer (3 rectangles on the left), then second layer etc.), however this could cause pipeline to run a lot longer as there could be some large table in each layer.
How to approach this? What is the best practice here?
Had a similar issue at work but I didn't used Execute Pipeline because it is a terrible approach in my case. I have more than 800 PLs to run with multiple parent and child dependencies that can go multiple levels deep depending the complexity of the data plus several restrictions (starting with transforming data for 9 regions in the US reusing PLs). A simplified diagram of one of many cases I have can easily look like this:
The solution:
A master dependency table where to store all the dependencies:
| Job ID | dependency ID | level | PL_name |
|--------|---------------|-------|--------------|
| Token1 | | 0 | |
| L1Job1 | Token1 | 1 | my_PL_name_1 |
| L1Job2 | Token1 | 1 | my_PL_name_2 |
| L2Job1 | L1Job1,L2Job2 | 2 | my_PL_name_3 |
| ... | ... | ... | ... |
From here it is a tree problem:
There are ways of mapping trees in SQL. Once you have all the dependencies mapped from a tree put them in a stage or tracker table:
| Job ID | dependency ID | level | status | start_date | end_date |
|--------|---------------|-------|-----------|------------|----------|
| Token1 | | 0 | | | |
| L1Job1 | Token1 | 1 | Running | | |
| L1Job2 | Token1 | 1 | Succeeded | | |
| L2Job1 | L1Job1,L2Job2 | 2 | | | |
| ... | ... | ... | ... | ... | ... |
We can easily query this table using a Look up activity to get the PLs level 1 to run and use a For Each activity to trigger the target PL to run with a dynamic Web Activity. Then Update the tracker table status, start_date, end_date, etc accordantly per PL.
There are only two PLs orchestrating:
one for mapping the tree and assign some type of unique ID for that batch.
two for validation (verifies status of parent PLs and controls which PL to run next)
Note: Both call a store procedure with some logic depending the case
I have a recursive call to the validation PL each time a target pipeline ends:
Lets assume L1Job1 and L1Job2 are running in parallel:
L1Job1 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
If L1Job2 hasn't ended the validation PL ends without triggering L2Job1.
Then L1Job2 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
L2Job1 starts running after passing the validations.
Repeat for each level.
This works because we already mapped all the PL dependencies in the job tracker and we know exactly which PLs should run.
I know this looks complicated and maybe can't apply to your case but I hope this can give you or others a clue on how to solve complex data workflows in Azure Data Factory.
Yes as per documentation, Maximum activities per pipeline, which includes inner activities for containers is 40 only.
So, there is only option left is splitting your pipeline in to multiple small pipelines.
Please check below link to know limitations on ADF
https://github.com/MicrosoftDocs/azure-docs/blob/master/includes/azure-data-factory-limits.md

ADF Pipeline Adding Sequential Value in Copy Activity

Apologies if this has been asked and answered elsewhere. If it is, please do refer to the url in comments on replies. So here is the situation,
I am making an API Request, in response I get auth_token which I use in the Copy Activity as Authorization to retrieve data in JSON format and Sink it to Azure SQL Database. I am able to Map all the elements I'm receiving in JSON to the columns of Azure SQL Database. However, there are two columns (UploadId and RowId) that still need to be populated.
UploadId is a GUID which will be same for the whole batch of rows (this I've managed to solve)
RowId will be a sequence starting from 1 to end of that batch entry, and then for next batch (with new GUID value) it resets back to 1.
The database will look something like this,
| APILoadTime | UploadId | RowId |
| 2020-02-01 | 29AD7-12345-22EwQ | 1 |
| 2020-02-01 | 29AD7-12345-22EwQ | 2 |
| 2020-02-01 | 29AD7-12345-22EwQ | 3 |
| 2020-02-01 | 29AD7-12345-22EwQ | 4 |
| 2020-02-01 | 29AD7-12345-22EwQ | 5 |
--------------------------------------------------> End of Batch One / Start of Batch Two
| 2020-02-01 | 30AD7-12345-22MLK | 1 |
| 2020-02-01 | 30AD7-12345-22MLK | 2 |
| 2020-02-01 | 30AD7-12345-22MLK | 3 |
| 2020-02-01 | 30AD7-12345-22MLK | 4 |
| 2020-02-01 | 30AD7-12345-22MLK | 5 |
--------------------------------------------------> End of Batch Two and so on ...
Is there a way in Azure Pipeline's Copy Activity to achieve this RowId behavior ... Or even if it's possible within Azure SQL Database.
Apologies for a long description, and Thank you in advance for any help!
Regards
You need to use a Window Function to achieve this. ADF Data Flows have Window Transformation activities that are designed to do this exact thing.
Otherwise, you could load the data into a staging table and then use Azure SQL to window the data as you select it out like...
SELECT
APILoadTime
,UploadId
,ROW_NUMBER() OVER (PARTITION BY UploadId ORDER BY APILoadTime) AS RowId
FROM dbo.MyTable;
Thanks a lot #Leon Yue and #JeffRamos, I've managed to figure out the solution, so posting it here for everyone else who might encounter the same situation,
The solution I found was to use a Stored Procedure within Azure Data Factory from where I call the Azure Data Flow Activity. This is the code I used for creating the RowId seed function,
CREATE PROCEDURE resetRowId
AS
BEGIN
DBCC CHECKIDENT ('myDatabase', RESEED, 0)
END
GO
Once I have this Stored Procedure, all I did was something like this,
This does it for you, the reason I kept it 0 so that when new data comes in, it starts from 1 again ...
Hope this helps others too ...
Thank you all who helped in someway

How can I generate a matrix based on two columns from two tables?

I'm working on creating an authorization matrix to hand out to my clients. My Excel workbook contains two tables: Applications and Permissions/Roles.
I'd like to take the first column of each table and dynamically generate an X,Y matrix on another worksheet, where my client can mark the combinations of application/role that are required.
So far, I've tried a pivot table, but those cannot be edited. I'd like to stay away from macros, since this will be given out to external clients.
In the end, I'd like to get a dynamically generated matrix that looks like the following:
Role 1 | Role 2 | Role 3 | Role 4 | ... | Role n |
App 1 | | | | | |
App 2 | | | | | |
App 3 | | | | | |
App 4 | | | | | |
... | | | | | |
App n | | | | | |
Any ideas?
If the image below is what you're looking for then you can use the method below and then copy and paste as values on the sheet that you send to the client.
You need to manually create unique list of Apps.
Next to each App there will be several array formulas to extract the associated roles.
=IFERROR(INDEX($B$3:$C$14,SMALL(IF(($B$3:$B$14=$C17),ROW($B$3:$C$14)-2),ROW($1:$1)),2),"")
if your data is in a excel Table then the formula is easier to create and read.
=IFERROR(INDEX(Table1,SMALL(IF((Table1[App Name]=$C16),ROW(Table1)-2),ROW($1:$1)),2),"")
This is an array formula so you need to use Ctrl + Shift + Enter to submit the formula.
The formula can be auto filled across but you will get the same value under each role. You need to change ROW($1:$1) to ROW($2:$2) under role 2 and ROW($3:$3) under role 3 etc. for as many roles as you think you will need. remembering to use Ctrl + Shift + Enter after each change.
Then simply fill down and it will populate all the roles for each App in your list.
Save a copy as a template and use copy and paste as values to remove all the scary formulas before you end it to your client.

Better way to refresh imported columns?

I have a table in spotfire with a couple columns imported from another table as a lookup. As an example, Col2 is used to match for the import of ImportedCol:
+------+------+-------------+
| Col1 | Col2 | ImportedCol |
+------+------+-------------+
| 1 | A | Val1 |
| 2 | B | Val2 |
| 3 | A | Val1 |
| 4 | C | Val3 |
| 5 | B | Val2 |
| 6 | A | Val1 |
| 7 | D | Val4 |
+------+------+-------------+
However, the data in Col2 is subject to change. In that event, I need ImportedCol to change with it, however Spotfire seems to just keep the old imported data. Right now I've been deleting the imported column and re-adding it to refresh the link. Is there a way to dynamically import the data as the document loads or with any refresh of the information links?
I have found that this happens sometimes although I'm not exactly sure how to explain why. my workaround is to create "virtual" data tables based on your existing ones.
consider your linked table as A and your embedded table as B. start from a default state -- that is, before importing any columns.
add a new data table. the source for this table should be "From Current Analysis" and using A. we will consider this one as C, and it becomes your main data table, and C will update when any changes are made to A or B.
to illustrate:
I found the issue.
Turns out that pivoting on data in the same table creates a circular reference which overrides the embed/link setting on that table. My workaround was to make the pivot as its own information link, then have the table join the original link and the new pivot one.

SpecFlow - Is it possible to reuse test data within feature file?

Is there any way to reuse data in SpecFlow feature files?
E.g. I have two scenarios, which both uses the same data table:
Scenario: Some scenario 1
Given I have a data table
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
When ...
Scenario: Some scenario 2
Given I have a data table
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
And I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
When ...
In these simple examples the tables are small and there not a big problem, however in my case, the tables have 20+ rows and will be used in at least 5 tests each.
I'd imagine something like this:
Having data table "Employee"
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
Scenario: Some scenario 1
Given I have a data table "Employee"
When ...
Scenario: Some scenario 2
Given I have a data table "Employee"
And I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
When ...
I couldn't find anything like this in SpecFlow documentation. The only suggestion for sharing data was to put it into *.cs files. However, I can't do that because the Feature Files will be used by non-technical people.
The Background is the place for common data like this until the data gets too large and your Background section ends up spanning several pages. It sounds like that might be the case for you.
You mention the tables having 20+ rows each and having several data tables like this. That would be a lot of Background for readers to wade through before the get to the Scenarios. Is there another way you could describe the data? When I had tables of data like this in the past I put the details into a fixtures class in the automation code and then described just the important aspects in the Feature file.
Assuming for the sake of an example that "Tom" is a potential car buyer and you're running some sort of car showroom then his data table might include:
| Field | Value |
| Name | Tom |
| Age | 16 |
| Address | .... |
| Phone Number | .... |
| Fav Colour | Red |
| Country | UK |
Your Scenario 2 might be "Under 18s shouldn't be able to buy a car" (in the UK at least). Given that scenario we don't care about Tom's address phone number, only his age. We could write that scenario as:
Scenario: Under 18s shouldnt be able to buy a car
Given there is a customer "Tom" who is under 16
When he tries to buy a car
Then I should politely refuse
Instead of keeping that table of Tom's details in the Feature file we just reference the significant parts. When the Given step runs the automation can lookup "Tom" from our fixtures. The step references his age so that a) it's clear to the reader of the Feature file who Tom is and b) to make sure the fixture data is still valid.
A reader of that scenario will immediately understand what's important about Tom (he's 16), and they don't have to continuously reference between the Scenario and Background. Other Scenarios can also use Tom and if they are interested in other aspects of his information (e.g. Address) then they can specify the relevant information Given there is a customer "Tom" who lives at 10 Downing Street.
Which approach is best depends how much of this data you've got. If it's a small number of fields across a couple of tables then put it in the Background, but once it gets to be 10+ fields or large numbers of tables (presumably we have many potential customers) then I'd suggest moving it outside the Feature file and just describing the relevant information in each Scenario.
Yes, you use a background, i.e. from https://github.com/cucumber/cucumber/wiki/Background
Background:
Given I have a data table "Employee"
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
Scenario: Some scenario 1
When ...
Scenario: Some scenario 2
Given I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
If ever you aren't sure I find http://www.specflow.org/documentation/Using-Gherkin-Language-in-SpecFlow/ a great resource

Resources