Apologies if this has been asked and answered elsewhere. If it is, please do refer to the url in comments on replies. So here is the situation,
I am making an API Request, in response I get auth_token which I use in the Copy Activity as Authorization to retrieve data in JSON format and Sink it to Azure SQL Database. I am able to Map all the elements I'm receiving in JSON to the columns of Azure SQL Database. However, there are two columns (UploadId and RowId) that still need to be populated.
UploadId is a GUID which will be same for the whole batch of rows (this I've managed to solve)
RowId will be a sequence starting from 1 to end of that batch entry, and then for next batch (with new GUID value) it resets back to 1.
The database will look something like this,
| APILoadTime | UploadId | RowId |
| 2020-02-01 | 29AD7-12345-22EwQ | 1 |
| 2020-02-01 | 29AD7-12345-22EwQ | 2 |
| 2020-02-01 | 29AD7-12345-22EwQ | 3 |
| 2020-02-01 | 29AD7-12345-22EwQ | 4 |
| 2020-02-01 | 29AD7-12345-22EwQ | 5 |
--------------------------------------------------> End of Batch One / Start of Batch Two
| 2020-02-01 | 30AD7-12345-22MLK | 1 |
| 2020-02-01 | 30AD7-12345-22MLK | 2 |
| 2020-02-01 | 30AD7-12345-22MLK | 3 |
| 2020-02-01 | 30AD7-12345-22MLK | 4 |
| 2020-02-01 | 30AD7-12345-22MLK | 5 |
--------------------------------------------------> End of Batch Two and so on ...
Is there a way in Azure Pipeline's Copy Activity to achieve this RowId behavior ... Or even if it's possible within Azure SQL Database.
Apologies for a long description, and Thank you in advance for any help!
Regards
You need to use a Window Function to achieve this. ADF Data Flows have Window Transformation activities that are designed to do this exact thing.
Otherwise, you could load the data into a staging table and then use Azure SQL to window the data as you select it out like...
SELECT
APILoadTime
,UploadId
,ROW_NUMBER() OVER (PARTITION BY UploadId ORDER BY APILoadTime) AS RowId
FROM dbo.MyTable;
Thanks a lot #Leon Yue and #JeffRamos, I've managed to figure out the solution, so posting it here for everyone else who might encounter the same situation,
The solution I found was to use a Stored Procedure within Azure Data Factory from where I call the Azure Data Flow Activity. This is the code I used for creating the RowId seed function,
CREATE PROCEDURE resetRowId
AS
BEGIN
DBCC CHECKIDENT ('myDatabase', RESEED, 0)
END
GO
Once I have this Stored Procedure, all I did was something like this,
This does it for you, the reason I kept it 0 so that when new data comes in, it starts from 1 again ...
Hope this helps others too ...
Thank you all who helped in someway
Related
I have a "master" pipeline in Azure Data factory, which looks like this:
One rectangle is Execute pipeline activity for 1 destination (target) Table, so this "child" pipeline takes some data, transform it and save as a specified table. Essentialy this means that before filling table on the right, we have to fill previous (connected with line) tables.
The problem is that this master pipeline contains more than 100 activities and the limit for data factory pipeline is 40 activities.
I was thinking about dividing pipeline into several smaller pipelines (i.e. first layer (3 rectangles on the left), then second layer etc.), however this could cause pipeline to run a lot longer as there could be some large table in each layer.
How to approach this? What is the best practice here?
Had a similar issue at work but I didn't used Execute Pipeline because it is a terrible approach in my case. I have more than 800 PLs to run with multiple parent and child dependencies that can go multiple levels deep depending the complexity of the data plus several restrictions (starting with transforming data for 9 regions in the US reusing PLs). A simplified diagram of one of many cases I have can easily look like this:
The solution:
A master dependency table where to store all the dependencies:
| Job ID | dependency ID | level | PL_name |
|--------|---------------|-------|--------------|
| Token1 | | 0 | |
| L1Job1 | Token1 | 1 | my_PL_name_1 |
| L1Job2 | Token1 | 1 | my_PL_name_2 |
| L2Job1 | L1Job1,L2Job2 | 2 | my_PL_name_3 |
| ... | ... | ... | ... |
From here it is a tree problem:
There are ways of mapping trees in SQL. Once you have all the dependencies mapped from a tree put them in a stage or tracker table:
| Job ID | dependency ID | level | status | start_date | end_date |
|--------|---------------|-------|-----------|------------|----------|
| Token1 | | 0 | | | |
| L1Job1 | Token1 | 1 | Running | | |
| L1Job2 | Token1 | 1 | Succeeded | | |
| L2Job1 | L1Job1,L2Job2 | 2 | | | |
| ... | ... | ... | ... | ... | ... |
We can easily query this table using a Look up activity to get the PLs level 1 to run and use a For Each activity to trigger the target PL to run with a dynamic Web Activity. Then Update the tracker table status, start_date, end_date, etc accordantly per PL.
There are only two PLs orchestrating:
one for mapping the tree and assign some type of unique ID for that batch.
two for validation (verifies status of parent PLs and controls which PL to run next)
Note: Both call a store procedure with some logic depending the case
I have a recursive call to the validation PL each time a target pipeline ends:
Lets assume L1Job1 and L1Job2 are running in parallel:
L1Job1 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
If L1Job2 hasn't ended the validation PL ends without triggering L2Job1.
Then L1Job2 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
L2Job1 starts running after passing the validations.
Repeat for each level.
This works because we already mapped all the PL dependencies in the job tracker and we know exactly which PLs should run.
I know this looks complicated and maybe can't apply to your case but I hope this can give you or others a clue on how to solve complex data workflows in Azure Data Factory.
Yes as per documentation, Maximum activities per pipeline, which includes inner activities for containers is 40 only.
So, there is only option left is splitting your pipeline in to multiple small pipelines.
Please check below link to know limitations on ADF
https://github.com/MicrosoftDocs/azure-docs/blob/master/includes/azure-data-factory-limits.md
In a Python notebook on Databricks "Community Edition", I'm experimenting with the City of San Francisco open data about emergency calls to 911 requesting firefighters. (The old 2016 copy of the data used in "Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) and made available on S3 for that tutorial.)
After mounting the data and reading it with the explicitly defined schema into a DataFrame fire_service_calls_df, I aliased that DataFrame as an SQL table:
sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")
With that and the DataFrame API, I can count the call types that occurred:
fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34
... or with SQL in Python:
spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
| 33|
+------------------------+
... or with an SQL cell:
%sql
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
Why do I get two different count results? (It seems like 34 is the correct one, even though the talk in the video and the accompanying tutorial notebook mention "35".)
To answer the question
Can Spark SQL not count correctly or can I not write SQL correctly?
from the title: I can't write SQL correctly.
Rule <insert number> of writing SQL: Think about NULL and UNDEFINED.
%sql
SELECT count(*)
FROM (
SELECT DISTINCT CallType
FROM fireServiceCalls
)
34
Also, i apparently can't read:
pault suggested in a comment
With only 30 something values, you could just sort and print all the distinct items to see where the difference is.
Well, I actually thought of that myself. (Minus the sorting.) Except, there wasn't any difference, there were always 34 call types in the output, whether I generated it with SQL or DataFrame queries. I simply didn't notice that one of them was ominously named null:
+--------------------------------------------+
|CallType |
+--------------------------------------------+
|Elevator / Escalator Rescue |
|Marine Fire |
|Aircraft Emergency |
|Confined Space / Structure Collapse |
|Administrative |
|Alarms |
|Odor (Strange / Unknown) |
|Lightning Strike (Investigation) |
|null |
|Citizen Assist / Service Call |
|HazMat |
|Watercraft in Distress |
|Explosion |
|Oil Spill |
|Vehicle Fire |
|Suspicious Package |
|Train / Rail Fire |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other |
|Transfer |
|Outside Fire |
|Traffic Collision |
|Assist Police |
|Gas Leak (Natural and LP Gases) |
|Water Rescue |
|Electrical Hazard |
|High Angle Rescue |
|Structure Fire |
|Industrial Accidents |
|Medical Incident |
|Mutual Aid / Assist Outside Agency |
|Fuel Spill |
|Smoke Investigation (Outside) |
|Train / Rail Incident |
+--------------------------------------------+
I have two Azure storage tables:
Table 1: This table is being updated regularly with a background task.
Table 2: Contains a subset of entities of table 1, but it need to be updated whenever there is a change in table 1.
| Table 2 | | Table 1 |
| | | A |
| B | | B |
| C | <=> | C |
| D | | D |
| | | E |
| | | F |
Basically what I want to achieve here is that Table 1 should always be listening to table 2 and whenever I add an entity to table 2, table 1 should know that I am interested in tracking that item and update both entities when there is an update available.
Well, here is a suggestion:
Have the background task send a message to a Service Bus Queue containing a reference to the changed entity
Make an Azure Function that listens to the queue, and it can check if an entity matching the updated one exists in the other table
If one does, it can update it
More info: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-service-bus
Support might be added in future for these events in Azure Event Grid.
When that happens, it can remove the need to post events about table changes, and you would only need the Event Grid subscription + Azure Function.
considering I have core data objects stored like this:
|Name | ActionType | Content | Date |
|-----|------------|---------|-----------|
|Abe | Create | "Hello" | 2014-07-01|
|Cat | Create | "Well" | 2014-07-01|
|Abe | Create | "Hi" | 2014-07-02|
|Bob | Edit | "Yo" | 2014-07-03|
|Cat | Delete | "What" | 2014-07-04|
|Abe | Edit | "Haha" | 2014-07-05|
I would like to get the last action of each user, so the results would be
|Abe | Edit | "Haha" | 2014-07-05|
|Cat | Delete | "What" | 2014-07-04|
|Bob | Edit | "Yo" | 2014-07-03|
Does anyone knows how to do that with a NSFetchRequest? So far from what I've gathered, if you want to use "group by", you can only retrieve the values in the group by cause (it will return "Abe, Cat, Bob" without the rest of the data in the core data object). Similar with "returnsDistinctResults", it will not return the whole object.
I have a feeling that core data is not equipped for that, any helps & hints would be appreciated!
Core Data is an object graph, not a database. Core Data itself has no concept of uniqueness, it's up to you to implement that in your application. This is most typically done using the find or create pattern. This pattern helps you prevent duplicate objects from being stored.
That said, you CAN return distinct results from Core Data using the NSDictionaryResultType. This will not prevent duplicates from being stored, but can be used to return distinct results from a fetch. There is an example of this in the programming guide. You can give this request all properties for a given entity by working with the NSEntityDescription of the managed object you are fetching.
For getting the object with the "last" timestamp for each, you actually want to get the object with the maximum value for that key path. That can be done a number of ways - a subquery, key path operators, expressions, etc.
Is there any way to reuse data in SpecFlow feature files?
E.g. I have two scenarios, which both uses the same data table:
Scenario: Some scenario 1
Given I have a data table
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
When ...
Scenario: Some scenario 2
Given I have a data table
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
And I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
When ...
In these simple examples the tables are small and there not a big problem, however in my case, the tables have 20+ rows and will be used in at least 5 tests each.
I'd imagine something like this:
Having data table "Employee"
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
Scenario: Some scenario 1
Given I have a data table "Employee"
When ...
Scenario: Some scenario 2
Given I have a data table "Employee"
And I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
When ...
I couldn't find anything like this in SpecFlow documentation. The only suggestion for sharing data was to put it into *.cs files. However, I can't do that because the Feature Files will be used by non-technical people.
The Background is the place for common data like this until the data gets too large and your Background section ends up spanning several pages. It sounds like that might be the case for you.
You mention the tables having 20+ rows each and having several data tables like this. That would be a lot of Background for readers to wade through before the get to the Scenarios. Is there another way you could describe the data? When I had tables of data like this in the past I put the details into a fixtures class in the automation code and then described just the important aspects in the Feature file.
Assuming for the sake of an example that "Tom" is a potential car buyer and you're running some sort of car showroom then his data table might include:
| Field | Value |
| Name | Tom |
| Age | 16 |
| Address | .... |
| Phone Number | .... |
| Fav Colour | Red |
| Country | UK |
Your Scenario 2 might be "Under 18s shouldn't be able to buy a car" (in the UK at least). Given that scenario we don't care about Tom's address phone number, only his age. We could write that scenario as:
Scenario: Under 18s shouldnt be able to buy a car
Given there is a customer "Tom" who is under 16
When he tries to buy a car
Then I should politely refuse
Instead of keeping that table of Tom's details in the Feature file we just reference the significant parts. When the Given step runs the automation can lookup "Tom" from our fixtures. The step references his age so that a) it's clear to the reader of the Feature file who Tom is and b) to make sure the fixture data is still valid.
A reader of that scenario will immediately understand what's important about Tom (he's 16), and they don't have to continuously reference between the Scenario and Background. Other Scenarios can also use Tom and if they are interested in other aspects of his information (e.g. Address) then they can specify the relevant information Given there is a customer "Tom" who lives at 10 Downing Street.
Which approach is best depends how much of this data you've got. If it's a small number of fields across a couple of tables then put it in the Background, but once it gets to be 10+ fields or large numbers of tables (presumably we have many potential customers) then I'd suggest moving it outside the Feature file and just describing the relevant information in each Scenario.
Yes, you use a background, i.e. from https://github.com/cucumber/cucumber/wiki/Background
Background:
Given I have a data table "Employee"
| Field Name | Value |
| Name | "Tom" |
| Age | 16 |
Scenario: Some scenario 1
When ...
Scenario: Some scenario 2
Given I have another data table
| Field Name | Value |
| Brand | "Volvo" |
| City | "London" |
If ever you aren't sure I find http://www.specflow.org/documentation/Using-Gherkin-Language-in-SpecFlow/ a great resource