Dynamic REST calls in Azure Synapse Pipeline - azure

I am making a call to a REST API with Azure Synapse and the return dataset looks something like this:
{
"links": [
{
"rel": "next",
"href": "[myRESTendpoint]?limit=1000&offset=1000"
},
{
"rel": "last",
"href": "[myRESTendpoint]?limit=1000&offset=60000"
},
{
"rel": "self",
"href": "[myRESTendpoint]"
}
],
"count": 1000,
"hasMore": true,
"items": [
{
"links": [],
"closedate": "6/16/2014",
"id": "16917",
"number": "62000",
"status": "H",
"tranid": "0062000"
},...
],
"offset": 0,
"totalResults": 60316
}
I am familiar with making a REST call to a single endpoint that can return all the the data with a single call using a Synapse pipeline, but this particular REST endpoint has a hard limit on only returning 1000 records, but it does give a property named "hasMore".
Is there a way to recursively make rest calls in a Synapse pipeline until the "hasMore" property equals false?
The end goal of this is to sink data to either a dedicated SQL pool or into ADLS2 and transform from there.

I have tried to achieve the same scenario using Azure Data Factory which seems to be more appropriate and easy to achieve the goal "The end goal of this is to sink data to either a dedicated SQL pool or into ADLS2 and transform from there".
As you have to hit the page recursively to fetch 1000 records , you might set it in the following fashion if the response header/response body contain the URL for the next page.
You're less likely to be able to use the functionality if the next page link or query parameter isn't included in the response headers/body.
Alternatively, you may utilise loop logic and do the Copy Activity.
Create two parameters in the Rest Connector:
Fill in the parameters for the RestConnector's relative URL.
Using the Set Variable action, the value of this variable would be increased in a loop. For each cycle, the URL for the Copy Activity is dynamically set.If you want to loop or iterate, you may use the Until activity.
Alternative:
In my experience, the REST connection pagination is quite rigid. Usually put the action within a loop. As a result, to have more control.
FOREACH Loop, here

For those following the thread, I used IpsitaDash-MT's suggestion using the ForEach loop. In the case of this API, when a call is made I get a property returned at the end of the call named "totalResults". Here are the steps I used to achieve what I was looking to do:
Make a dummy call to the API to get the "totalResults" parameter. This is just a call to return the number of results I am looking to get. In the case of this API, the body of the request is a SQL statement, so when the dummy request is made I am only asking for the ID's of the results I am looking to get.
SQL statement example
I then take the property "totalResults" from that request set a dynamic value in the "Items" of the ForEach loop like this:
#range(0,add(div(sub(int(activity('Get Pages Customers').output.totalResults),mod(int(activity('Get Pages Customers').output.totalResults),1000)),1000),1))
NOTE: The API only allows pages of 1000 results, I do some math to get a range of page numbers. I also have to add 1 to the final result to include the last page.
ForEach Loop Settings
In the API I have two parameters that can be passed "limit" and "offset". Since I want all of the data there is no reason to have limit set to anything other than 1000 (the max allowable number). The offset parameter can be set to any number less than or equal to "totalResults" - "limit" and greater than or equal to 0. So I use the range established in step 2 and multiply it out by 1000 to set the offset parameter in the URL.
Setting the offset parameter in the copy data activity
Dynamic value of the Relative URL in the REST connector
NOTE: I found it better to sink the data as JSON into ADLS2 first rather than into a dedicated SQL pool due to the Lookup feature.
Since synapse does not allow nested ForEach loops, I run the data through a data flow to format the data and check for duplicates and updates.
When the data flow is completed it kicks off a lookup activity to get the data that was just processed and pass it into a new pipeline to use another ForEach loop to get the child data for each ID of parent data.
Data Flow and Lookup for child data pipeline

Related

Get Nested Output Dynamically in Azure Data Factory

I want to build an expression in ADF via concatenation, then evaluate the nested expression.
Basically, I have a Web Activity which is returning json output. I need to access an element of the output that has multiple possible keys, and can be nested at multiple levels. I want to use pipeline parameters to access my desired element regardless of the key or level it resides at.
Here is a sample input:
{
"status": "OK",
"code": 200,
"timestamp": "2020-11-02T15:22:59Z",
"messages": [],
"result": {}
"paging" : {"total_count" : 1000}
}
I can grab the desired output statically like this:
#{activity('callAPI').output['paging']['total_count']}
I can also generate the above expression dynamically like this:
#{concat('activity(''callAPI'').output', pipeline().parameters.myPipelineParam)}
However, once I create the expression via concatenation, I can't figure out how to also evaluate it in the same expression.
Any ideas on how to do this, or perhaps a better method I'm not seeing?

Azure Data Factory V2 Dynamic Content

Long story short, I have a data dump that is too large for an azure function. So we are using Data Factory.
I have tasked another function to generate an access token for an API and output it as part of a json. I would like to set that token to a variable within the pipeline. So far I have this:
I'm attempting to use the Dynamic Content "language" to set the variable:
#activity('Get_Token').output
I'd like something like pythons:
token = data.get('data', {}).get('access_token', '')
As a secondary question, my next step is to use this token to call an API while iterating over another output, so perhaps this exact step can be added into the ForEach?
Looks like the variable should be #activity('Get token').output.data.access_token as others have indicated but, as you've guessed, there's no need to assign a variable if you only need it within the foreach. You can access any predecessor output from that successor activity. Here's how to use the token while iterating over another output:
Let's say your function also outputs listOfThings as an array
within the data key. Then you can set the foreach activity to
iterate over #activity('Get token').output.data.listOfThings.
Inside the foreach you will have (let's say) a Copy activity with a
REST dataset as the source. Configure the REST linked service
with anonymous auth ...
... then you'll find a field called Additional
Headers in the REST dataset where you can create a key Authorization
with value as above, Basic #activity('Get token').output.data.access_token
The thing that you said you want to iterate over (in the listOfThings JSON array) can be referenced inside the foreach activity with
#item() (or, if it's a member of an item in the listOfThings
iterable then it would be #item().myMember)
To make #4 explicit for anyone else arriving here:
If listOfThings looks like this, listOfThings: [ "thing1", "thing2", ...]
for example, filenames: ["file1.txt", "file2.txt", ...]
then #item() becomes file1.txt etc.
whereas
If listOfThings looks like this, listOfThings: [ {"key1":"value1", "key2":"value2" ... }, {"key1":"value1", "key2":"value2" ... }, ...]
for example. filenames: [ {"folder":"folder1", "filename":"file1.txt"}, {"folder":"folder2", "filename":"file2.txt"}, ... ]
then #item().filename becomes file1.txt etc.

Azure Data Factory foreach activity step size support

I have a pipeline that contains a list of IDs as input and I need to iterate through these IDs and call a REST API using batches of 10 IDs per time (these IDs will be passed as a parameter into JSON request).
1) Is there any approach using forEach activity in Data Factory passing the step size?
2) Do you have any other suggestions of how to accomplish this?
I have tried using "forEach" loop and also thinking in a way to use "setVariable" and "appendVariable" activities to store the current index during the loop, but also couldn't find a way to get the current index during the "forEach".
You should use a LookupActivity. With that you can get information from database, files or whatever and them pass it to a ForEach Loop.
Consider I have the following information in my txt file:
name|age
orochiBrabo|25
NarutoBoy|98
You can recover it using LookupActivity which I will call MyLookUp and then connect it box with a ForEach Box.
In ForEach Activity setting tab you write #activity('MyLookUp').output.value and now you can iterate over all rows in the file. Inside your ForEach you can refer results like item().age , item().name or item().myColumnName.

Indexing arrays in CosmosDB

Why doesn't CosmosDB index arrays by default? The default index path is
"path": "/*"
Doesn't that mean "index everything"? Not "index everything except arrays".
If I add my array field to the index with something like this:
"path": "/tags/[]/?"
It will work and start indexing that particular array field.
But my question is why doesn't "index everything" index everything?
EDIT: Here's a blog post that describes the behavior I'm seeing. http://www.devwithadam.com/2017/08/querying-for-items-in-array-in-cosmosdb.html Array_Contains queries are very slow, clearly not using the index. If you add the field in question to the index explicitly then the queries are fast (clearly they start using the index).
"New" index layout
As stated in Index Types
Azure Cosmos containers support a new index layout that no longer uses
the Hash index kind. If you specify a Hash index kind on the indexing
policy, the CRUD requests on the container will silently ignore the
index kind and the response from the container only contains the Range
index kind. All new Cosmos containers use the new index layout by
default.
The below issue does not apply to the new index layout. There the default indexing policy works fine (and delivers the results in 36.55 RUs). However pre-existing collections may still be using the old layout.
"Old" index layout
I was able to reproduce the issue with ARRAY_CONTAINS that you are asking about.
Setting up a CosmosDB collection with 100,000 posts from the SO data dump (e.g. this question would be represented as below)
{
"id": "50614926",
"title": "Indexing arrays in CosmosDB",
/*Other irrelevant properties omitted */
"tags": [
"azure",
"azure-cosmosdb"
]
}
And then performing the following query
SELECT COUNT(1)
FROM t IN c.tags
WHERE t = 'sql-server'
The query took over 2,000 RUs with default indexing policy and 93 with the following addition (as shown in your linked article)
{
"path": "/tags/[]/?",
"indexes": [
{
"kind": "Hash",
"dataType": "String",
"precision": -1
}
]
}
However what you are seeing here is not that the array values aren't being indexed by default. It is just that the default range index is not useful for your query.
The range index uses keys based on partial forward paths. So will contain paths such as the following.
tags/0/azure
tags/0/c#
tags/0/oracle
tags/0/sql-server
tags/1/azure-cosmosdb
tags/1/c#
tags/1/sql-server
With this index structure it starts at tags/0/sql-server and then reads all of the remaining tags/0/ entries and the entirety of the entries for tags/n/ where n is an integer greater than 0. Each distinct document mapping to any of these needs to be retrieved and evaluated.
By contrast the hash index uses reverse paths (more details - PDF)
StackOverflow theoretically allows a maximum of 5 tags per question to be added by the UI so in this case (ignoring the fact that a few questions have more tags through site admin activities) the reverse paths of interest are
sql-server/0/tags
sql-server/1/tags
sql-server/2/tags
sql-server/3/tags
sql-server/4/tags
With the reverse path structure finding all paths with leaf nodes of value sql-server is straight forward.
In this specific use case as the arrays are bounded to a maximum of 5 possible values it is also possible to use the original range index efficiently by looking at just those specific paths.
The following query took 97 RUs with default indexing policy in my test collection.
SELECT COUNT(1)
FROM c
WHERE 'sql-server' IN (c.tags[0], c.tags[1], c.tags[2], c.tags[3], c.tags[4])
Cosmos DB does indexes all the element of an Array. By, default, All Azure Cosmos DB data is indexed. Read more here https://learn.microsoft.com/en-us/azure/cosmos-db/indexing-policies

Populate Azure Data Factory dataset from query

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

Resources