Bringing incremental data in from REST APIs into SQL azure - azure

My needs are following:
- Need to fetch data from a 3rd party API into SQL azure.
The API's will be queried everyday for incremental data and may require pagination as by default any API response will give only Top N records.
The API also needs an auth token to work, which is the first call before we start downloading data from endpoints.
Due to last two reasons, I've opted for Function App which will be triggered daily rather than data factory which can query web APIs.
Is there a better way to do this?
Also I am thinking of pushing all JSON into Blob store and then parsing data from the JSON into SQL Azure. Any recommendations?

How long does it take to call all of the pages? If it is under ten minutes, then my recommendation would be to build an Azure Function that queries the API and inserts the json data directly into a SQL database.
Azure Function
Azure functions are very cost effective. The first million execution are free. If it takes longer than ten, then have a look at durable functions. For handling pagination, we have plenty of examples. Your exact solution will depend on the API you are calling and the language you are using. Here is an example in C# using HttpClient. Here is one for Python using Requests. For both, the pattern is similar. Get the total number of pages from the API, set a variable to that value, and loop over the pages; Getting and saving your data in each iteration. If the API won't provide the max number of pages, then loop until you get an error. Protip: Make sure specify an upper bound for those loops. Also, if your API is flakey or has intermittent failures, consider using a graceful retry pattern such as exponential backoff.
Azure SQL Json Indexed Calculated Columns
You mentioned storing your data as json files into a storage container. Are you sure you need that? If so, then you could create an external table link between the storage container and the database. That has the advantage of not having the data take up any space in the database. However, if the json will fit in the database, I would highly recommend dropping that json right into the SQL database and leveraging indexed calculated columns to make querying the json extremely quick.
Using this pairing should provide incredible performance per penny value! Let us know what you end up using.

Maybe you can create a time task by SQL server Agent.
SQL server Agent--new job--Steps--new step:
In the Command, put in your Import JSON documents from Azure Blob Storage sql statemanets for example.
Schedules--new schedule:
Set Execution time.
But I think Azure function is better for you to do this.Azure Functions is a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP.
It is more intuitive and efficient.
Hope this helps.

If you could set the default top N values in your api, then you could use web activity in azure data factory to call your rest api to get the response data.Then configure the response data as input of copy activity(#activity('ActivityName').output) and the sql database as output. Please see this thread :Use output from Web Activity call as variable.
The web activity support authentication properties for your access token.
Also I am thinking of pushing all JSON into Blob store and then
parsing data from the JSON into SQL Azure. Any recommendations?
Well,if you could dump the data into blob storage,then azure stream analytics is the perfect choice for you.
You could run the daily job to select or parse the json data with asa sql ,then dump the data into sql database.Please see this official sample.

One thing to consider for scale would be to parallelize both the query and the processing. If there is no ordering requirement, or if processing all records would take longer than the 10 minute function timeout. Or if you want to do some tweaking/transformation of the data in-flight, or if you have different destinations for different types of data. Or if you want to be insulated from a failure - e.g., your function fails halfway through processing and you don't want to re-query the API. Or you get data a different way and want to start processing at a specific step in the process (rather than running from the entry point). All sorts of reasons.
I'll caveat here to say that the best degree of parallelism vs complexity is largely up to your comfort level and requirements. The example below is somewhat of an 'extreme' example of decomposing the process into discrete steps and using a function for each one; in some cases it may not make sense to split specific steps and combine them into a single one. Durable Functions also help make orchestration of this potentially easier.
A timer-driven function that queries the API to understand the depth of pages required, or queues up additional pages to a second function that actually makes the paged API call
That function then queries the API, and writes to a scratch area (like Blob) or drops each row into a queue to be written/processed (e.g., something like a storage queue, since they're cheap and fast, or a Service Bus queue if multiple parties are interested (e.g., pub/sub)
If writing to scratch blob, a blob-triggered function reads the blob and queues up individual writes to a queue (e.g., a storage queue, since a storage queue would be cheap and fast for something like this)
Another queue-triggered function actually handles writing the individual rows to the next system in line, SQL or whatever.
You'll get some parallelization out of that, plus the ability to start from any step in the process, with a correctly-formatted message. If your processors encounter bad data, things like poison queues/dead letter queues would help with exception cases, so instead of your entire process dying, you can manually remediate the bad data.

Related

Azure durable functions and synchronous read?

I'm trying to use Azure Durable Functions to implement a minimal server that accepts data via a POST and returns the same data via a GET - but I don't seem able to generate the GET response.
Is it simply not possible to return the response via a simple GET response? What I do NOT want to happen is:
GET issued
GET response returns with second URL
Client has to use second URL to get result.
I just want:
GET issues
GET response with requested data.
Possible?
I'm not really sure whether durable functions are meant to be used as a 'server'. Durable functions are part of 'serverless' and therefore I'm not sure whether it's possible (in a clean way).
To my knowledge durable functions are used to orchestrate long lasting processes. So for example handling the orchestration of a batch job. Using durable functions it's possible to create an 'Async HTTP API', to check upon the status of the processing a batch of items (durable function documentation). I've wrote a blogpost about durable functions, feel free to read it (https://www.luminis.eu/blog/azure-functions-for-your-long-lasting-logic/).
But as for your use case; I think you can create two separate Azure functions. One for posting your data, you can use an Azure Blob Storage output binding. Your second function can have a GET http trigger and depending on your data you can use an blob input binding. No need to use durable functions for it :)!
Not really a direct answer to your question, but hopefully a solution to your problem!

Firebase Cloud Express queue for storage resource to be generated

I have a large dataset stored in a Firestore collection and a Nodejs express app (exposed as a firebase functions.https.onRequest) with an endpoint which allows users to query this dataset and download large amounts of data.
I need to return the data in CSV format from the endpoint. Because there is a lot of data, I want to avoid doing large database reads each time the endpoint is hit.
My current endpoint does this:
User hits endpoint with a database query requesting documents within a range
Query is hashed into a filename. eg query_"startRange"_"endRange".csv
Check Firebase storage to see if this query has been run before
if the csv already exists:
return a 302 redirect to the csv file with a signed url
if the csv doesn't exist:
Run the query on the Firestore collection
Transform the data into the appropriate CSV format
upload the new CSV to Firebase storage
return a 302 redirect to the newly generated csv file with a signed url
This process is currently working really well, except I can already foresee an issue. The CSV generation stage takes roughly 20s for large queries and there is a high possibility of the same request being hit from multiple users at the same time.
I want to build in some sort of queuing system so that if X number of users hit the endpoint at once, only the first request triggers the generation of the new CSV and the other (X-1) requests will be queued and then resolved once the CSV is generated.
I have currently looked into firebase-queue which appears to be deprecated and not intended to be used with Cloud functions.
I have also seen other libraries like p-queue but I'm not sure I understand how that would work with Firebase Cloud functions and how seperate instances are booted for many requests.
I think that in your scenario the queue approach wouldn't work quite well with Cloud Functions. The queue cannot be implemented in a function as multiple instances won't know about each other, therefore the queue would need to be implemented in some kind of dedicated server, which IMO defeats the purpose of using Cloud Functions as both the queue and the processing could be ran in the same server.
I would suggest having a collection in Firestore that keeps track of the queries that have been requested. This way even if the CSV file isn't still saved on Storage you could check if some function instance is already creating it, then you could sleep the function until the operation completes and return the signed url. Overall the algorithm might look somewhat like this:
# Python PseudoCode
if csv_in_storage():
return signed_url()
if query_in_firestore():
while True:
sleep(X)
if csv_in_storage():
return signed_url()
try:
add_query_firestore()
csv = create_csv()
upload_csv(csv)
return signed_url()
except Exception:
while True:
sleep(X)
if csv_in_storage():
return signed_url()
The final try/catch is there because the add_query_firestore operation might eventually fail if two functions make simultaneous attempts to write the same document into Firestore. Nonetheless this are also good news since you know the CSV creation is in progress and you can wait for it to complete.
Please keep in mind the pseudocode above is just to illustrate the idea, having the while True as it is may lead to infinite loop and function timeout which is plain bad :).
I ended up solving this using a solution similar to what Happy-Monad suggested. I'm using the nodejs admin SDK but the idea is similar.
There is a collection in Firestore which keeps track of executed queries Queries. When a user hits the endpoint, I call the admin doc("Queries/<queryId>").create() method. This method will only create the query doc if it doesn't already exist, so I am able to avoid race conditions between parallel requests if I were to check for existing queries first.
Next the request starts an onSnapshot listener to the query that it attempted to created. The query has a status field which starts as created. The onSnapshot will only resolve once that status has changed to complete.
I have onCreate database trigger listening to "Queries/*". This database trigger handles the requested query and updates the query status to complete. In the case that the query already exists, the status is already in the complete state, so the onSnapshot resolves instantly.

Spark results accessible through API

We really would want to get an input here about how the results from a Spark Query will be accessible to a web-application. Given Spark is a well used in the industry I would have thought that this part would have lots of answers/tutorials about it, but I didnt find anything.
Here are a few options that come to mind
Spark results are saved in another DB ( perhaps a traditional one) and a request for query returns the new table name for access through a paginated query. That seems doable, although a bit convoluted as we need to handle the completion of the query.
Spark results are pumped into a messaging queue from which a socket server like connection is made.
What confuses me is that other connectors to spark, like those for Tableau, using something like JDBC should have all the data (not the top 500 that we typically can get via Livy or other REST interfaces to Spark). How do those connectors get all the data through a single connection.
Can someone with expertise help in that sense?
The standard way I think would be to use Livy, as you mention. Since it's a REST API you wouldn't expect to get a JSON response containing the full result (could be gigabytes of data, after all).
Rather, you'd use pagination with ?from=500 and issue multiple requests to get the number of rows you need. A web application would only need to display or visualize a small part of the data at a time anyway.
But from what you mentioned in your comment to Raphael Roth, you didn't mean to call this API directly from the web app (with good reason). So you'll have an API layer that is called by the web app and which then invokes Spark. But in this case, you can still use Livy+pagination to achieve what you want, unless you specifically need to have the full result available. If you do need the full results generated on the backend, you could design the Spark queries so they materialize the result (ideally to cloud storage) and then all you need is to have your API layer access the storage where Spark writes the results.

Multiple endpoints of Microsoft Cognitive Services simultaneously

So I'm writing an app in Node.js which analyses sentiment and extracts key phrases using Microsoft Cognitive services. To do these two separate things, I use two nested POST calls, one starting when the other finishes. However, this seems inefficient. The documentation mentions that, to examine sentiment for example, the URL 'must contain /sentiment', not that it must be the only endpoint for that one call. Perhaps I could do both in one call and save myself Azure budget and execution time?
I have tried /sentiment/keyPhrases, that returns 404. /sentiment/../keyPhrases just does keyPhrases. I cannot think of another way to combine the two endpoints.
Turns out I was mislead by my understanding of the documentation. Endpoints must be called separately, and the answers can be combined in Node afterwards using response = response1.concat(response2);.

Azure function misbehaving with multiple http hits at same time

I have created a azure function on powershell which works on http hit . It is writing a JSON file at its root folder after processing it. But if multiple hit occurs at same time then it throws file in use error. I know azure function doesn't work good on multi threading and variables can be modified while running one process and 2nd process occurs. I don't want to use queue storage so any good suggestion how to do it?
DO NOT WRITE ANYTHING TO THE AZURE FUNCTIONS FILESYSTEM THAT YOU DO NOT WANT TO LOSE.
Use Cosmos DB or some other external data store to store your data.
Without a code sample, it is hard to say what you might be doing wrong. You should be able to do this fine, from a code point of view, but you need to check if the file is in use and handle errors when it is (i.e. wait or throw a 429 error/etc.)

Resources