My needs are following:
- Need to fetch data from a 3rd party API into SQL azure.
The API's will be queried everyday for incremental data and may require pagination as by default any API response will give only Top N records.
The API also needs an auth token to work, which is the first call before we start downloading data from endpoints.
Due to last two reasons, I've opted for Function App which will be triggered daily rather than data factory which can query web APIs.
Is there a better way to do this?
Also I am thinking of pushing all JSON into Blob store and then parsing data from the JSON into SQL Azure. Any recommendations?
How long does it take to call all of the pages? If it is under ten minutes, then my recommendation would be to build an Azure Function that queries the API and inserts the json data directly into a SQL database.
Azure Function
Azure functions are very cost effective. The first million execution are free. If it takes longer than ten, then have a look at durable functions. For handling pagination, we have plenty of examples. Your exact solution will depend on the API you are calling and the language you are using. Here is an example in C# using HttpClient. Here is one for Python using Requests. For both, the pattern is similar. Get the total number of pages from the API, set a variable to that value, and loop over the pages; Getting and saving your data in each iteration. If the API won't provide the max number of pages, then loop until you get an error. Protip: Make sure specify an upper bound for those loops. Also, if your API is flakey or has intermittent failures, consider using a graceful retry pattern such as exponential backoff.
Azure SQL Json Indexed Calculated Columns
You mentioned storing your data as json files into a storage container. Are you sure you need that? If so, then you could create an external table link between the storage container and the database. That has the advantage of not having the data take up any space in the database. However, if the json will fit in the database, I would highly recommend dropping that json right into the SQL database and leveraging indexed calculated columns to make querying the json extremely quick.
Using this pairing should provide incredible performance per penny value! Let us know what you end up using.
Maybe you can create a time task by SQL server Agent.
SQL server Agent--new job--Steps--new step:
In the Command, put in your Import JSON documents from Azure Blob Storage sql statemanets for example.
Schedules--new schedule:
Set Execution time.
But I think Azure function is better for you to do this.Azure Functions is a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP.
It is more intuitive and efficient.
Hope this helps.
If you could set the default top N values in your api, then you could use web activity in azure data factory to call your rest api to get the response data.Then configure the response data as input of copy activity(#activity('ActivityName').output) and the sql database as output. Please see this thread :Use output from Web Activity call as variable.
The web activity support authentication properties for your access token.
Also I am thinking of pushing all JSON into Blob store and then
parsing data from the JSON into SQL Azure. Any recommendations?
Well,if you could dump the data into blob storage,then azure stream analytics is the perfect choice for you.
You could run the daily job to select or parse the json data with asa sql ,then dump the data into sql database.Please see this official sample.
One thing to consider for scale would be to parallelize both the query and the processing. If there is no ordering requirement, or if processing all records would take longer than the 10 minute function timeout. Or if you want to do some tweaking/transformation of the data in-flight, or if you have different destinations for different types of data. Or if you want to be insulated from a failure - e.g., your function fails halfway through processing and you don't want to re-query the API. Or you get data a different way and want to start processing at a specific step in the process (rather than running from the entry point). All sorts of reasons.
I'll caveat here to say that the best degree of parallelism vs complexity is largely up to your comfort level and requirements. The example below is somewhat of an 'extreme' example of decomposing the process into discrete steps and using a function for each one; in some cases it may not make sense to split specific steps and combine them into a single one. Durable Functions also help make orchestration of this potentially easier.
A timer-driven function that queries the API to understand the depth of pages required, or queues up additional pages to a second function that actually makes the paged API call
That function then queries the API, and writes to a scratch area (like Blob) or drops each row into a queue to be written/processed (e.g., something like a storage queue, since they're cheap and fast, or a Service Bus queue if multiple parties are interested (e.g., pub/sub)
If writing to scratch blob, a blob-triggered function reads the blob and queues up individual writes to a queue (e.g., a storage queue, since a storage queue would be cheap and fast for something like this)
Another queue-triggered function actually handles writing the individual rows to the next system in line, SQL or whatever.
You'll get some parallelization out of that, plus the ability to start from any step in the process, with a correctly-formatted message. If your processors encounter bad data, things like poison queues/dead letter queues would help with exception cases, so instead of your entire process dying, you can manually remediate the bad data.
So I'm writing an app in Node.js which analyses sentiment and extracts key phrases using Microsoft Cognitive services. To do these two separate things, I use two nested POST calls, one starting when the other finishes. However, this seems inefficient. The documentation mentions that, to examine sentiment for example, the URL 'must contain /sentiment', not that it must be the only endpoint for that one call. Perhaps I could do both in one call and save myself Azure budget and execution time?
I have tried /sentiment/keyPhrases, that returns 404. /sentiment/../keyPhrases just does keyPhrases. I cannot think of another way to combine the two endpoints.
Turns out I was mislead by my understanding of the documentation. Endpoints must be called separately, and the answers can be combined in Node afterwards using response = response1.concat(response2);.
I've been working on a Xamarin.Forms application in Visual Studio using Azure for the backend for a while now, and I've come across a really strange issue.
Please note, that I am following the methods mentioned in this blog
For some strange reason the PullAsync() method seems to have some bizarre problems. Any data that I create and sync will only be pulled by PullAsync() from that solution. What I mean by that is that if I create another solution that accesses the exact same backend, it can perform it's own create/sync data, but will not bring over the data generated by the other solution, even though they both seem to have the exact same access. This appears to be some kind of a security feature/issue, but I can't quite make sense of it.
Has anyone else encountered this at all? Was there a work-around at all? This could potentially cause problems down the road if I were to ever want to create another solution that accesses the same system/data for whatever reason.
For some strange reason the PullAsync() method seems to have some bizarre problems. Any data that I create and sync will only be pulled by PullAsync() from that solution.
According to your provided tutorial, I found that the related PullAsync is using Incremental Sync.
await coffeeTable.PullAsync("allCoffees", coffeeTable.CreateQuery());
Incremental Sync:
the first parameter to the pull operation is a query name that is used only on the client. If you use a non-null query name, the Azure Mobile SDK performs an incremental sync. Each time a pull operation returns a set of results, the latest updatedAt timestamp from that result set is stored in the SDK local system tables. Subsequent pull operations retrieve only records after that timestamp.
Here is my test, you could refer to it for a better understanding of Incremental Sync:
Client : await todoTable.PullAsync("todoItems-02", todoTable.CreateQuery());
The client SDK would check if there has a record with the id equals deltaToken|{table-name}|{query-id} from the __config table of your SQLite local store.
If there has no record, then the SDK would send a request as following for pulling your records:
https://{your-mobileapp-name}.azurewebsites.net/tables/TodoItem?$filter=(updatedAt%20ge%20datetimeoffset'1970-01-01T00%3A00%3A00.0000000%2B00%3A00')&$orderby=updatedAt&$skip=0&$top=50&__includeDeleted=true
Note: the $filter would be set as (updatedAt ge datetimeoffset'1970-01-01T00:00:00.0000000+00:00')
While there has a record, then the SDK would pick up the value as the latest updatedAt timestamp and send the request as follows:
https://{your-mobileapp-name}.azurewebsites.net/tables/TodoItem?$filter=(updatedAt%20ge%20datetimeoffset'2017-06-26T02%3A44%3A25.3940000%2B00%3A00')&$orderby=updatedAt&$skip=0&$top=50&__includeDeleted=true
Per my understanding, if you handle the same logical query with the same query id (non-null) in different mobile client, you need to make sure the local db is newly created by each client. Also, if you want to opt out of incremental sync, pass null as the query ID. In this case, all records are retrieved on every call to PullAsync, which is potentially inefficient. For more details, you could refer to How offline synchronization works.
Additionally, you could leverage fiddler for capturing the network traces when you invoke the PullAsync, in order to troubleshoot your issue.
I have a class structure as following:
UserDepartments(1)->(n)Categories(1)->(n)Templates(1)->(n)reports
I am using Azure offline data sync with incremental sync. There are 2 major issues we are facing with this.
The code is here
Issues:
Is there any better way of downloading all this related content then doing foreach under foreach?
Intermittently we see that not all the content that has been changed on the server by another Web App downloads & syncs fine when incremental sync is on. Is there a way we can flush the cache list created by the key (the first parameter in PullAsync) used in Incremental Sync? Or do you see something we need to change in order to make sure that we download correct data on each sync?
Is there any better way of downloading all this related content then doing foreach under foreach?
Pull is performed on a per-table basis, we can’t downloading all the related content at once.
Is there a way we can flush the cache list created by the key (the first parameter in PullAsync) used in Incremental Sync?
Incremental sync is default support by PullAsync method if you pass the non-null value as the value of queryId parameter. But there are two points we need to pay attention to.
The queryId must be unique for difference pull method.
The field filter in later parameter must support sorting.
I am a newbie to silverlight & wcf ria service and am only finding examples where the dataset received from calling the domain service query is bound to a datagrid. I want to be able to access each record individually, without the dataset being bound to anything. How may I do that or does every data in silverlight need to be bound? Example: I call getlocations() and then need to iterate through each record and get the geographical points associated with each location.
Okay. I was trying to access the records before they were finished being retrieved (the async operation was not complete). So I transferred the function that uses the data to the event handler of the domain service load method, and then the function worked okay.