I am creating a Nodejs app that consumes APIs of multiple servers in a sequential manner as the next request depends on results from previous requests.
For instance, user registration is done at our platform in PostgreSQL database. User feeds, chats, posts are stored at getStream servers. User roles and permissions are managed through CMS. If in a page we want to display a list of user followers with some buttons as per the user permissions then first I need to find list of my current user's followers from getStream then enrich them with my PostgreSQL DB then fetch their permissions from CMS. Since one request has to wait for another it takes long time to give response.
I need to serve all that data in a certain format. I have used Promise.all() where requests were not depending on each other.
I thought of a way to store pre-processed data that is ready to be served but I am not sure how to do that. What is the best way to solve this problem?
sequential manner as the next request depends on results from previous requests
you could try using async/await so that each request will run in a sequential manner.
I have a NightwatchJS test that requires a MSSQL Db call to complete before moving on to then verify the results of that db call. The test is 2 parts.
Part 1 fills out a form and submits it into our website and verifies via API in our CMS that the form successfully posted and saves the Guid in a table in another db specifically for Nightwatch testing for verification later.
Part 2 runs later in the day to allow another internal process to 'ingest' that form into a different department's db and return the ingestion results into our CMS for that form. Part 2 then needs to do a Db lookup into the Nightwatch db and get all Guids that happened in the last 24 hours and hit another API endpoint in our CMS to verify ingestion of that form occurred into that other department by checking a field that the other department's process updates on ingestion.
I had the same issue of waiting-to-complete with the API calls where I needed NightwatchJS to wait for the API call to complete in order to use the results in the assertion. To solve that, I used a synchronous http library called 'sync-request'. So, that part works fine.
However, I cannot seem to find a synchronous Db library that works in Nightwatch's world.
I am currently using 'tedious' as my Db library but there is no mechanism for awaiting. I tried promises and async/await to no avail since the async stuff is wrapped inside the library.
I tried Co-mssql but I keep getting an error
TypeError: co(...) is not a function
using their exact example code...
Any ideas or suggestions?
Any other synchronous MSSQL libraries that work in NightwatchJS?
Any way of using 'tedious' is a fashiopn that ensures await-ability?
I found a library called sync-request that blocks the thread which is what i need.
My needs are following:
- Need to fetch data from a 3rd party API into SQL azure.
The API's will be queried everyday for incremental data and may require pagination as by default any API response will give only Top N records.
The API also needs an auth token to work, which is the first call before we start downloading data from endpoints.
Due to last two reasons, I've opted for Function App which will be triggered daily rather than data factory which can query web APIs.
Is there a better way to do this?
Also I am thinking of pushing all JSON into Blob store and then parsing data from the JSON into SQL Azure. Any recommendations?
How long does it take to call all of the pages? If it is under ten minutes, then my recommendation would be to build an Azure Function that queries the API and inserts the json data directly into a SQL database.
Azure Function
Azure functions are very cost effective. The first million execution are free. If it takes longer than ten, then have a look at durable functions. For handling pagination, we have plenty of examples. Your exact solution will depend on the API you are calling and the language you are using. Here is an example in C# using HttpClient. Here is one for Python using Requests. For both, the pattern is similar. Get the total number of pages from the API, set a variable to that value, and loop over the pages; Getting and saving your data in each iteration. If the API won't provide the max number of pages, then loop until you get an error. Protip: Make sure specify an upper bound for those loops. Also, if your API is flakey or has intermittent failures, consider using a graceful retry pattern such as exponential backoff.
Azure SQL Json Indexed Calculated Columns
You mentioned storing your data as json files into a storage container. Are you sure you need that? If so, then you could create an external table link between the storage container and the database. That has the advantage of not having the data take up any space in the database. However, if the json will fit in the database, I would highly recommend dropping that json right into the SQL database and leveraging indexed calculated columns to make querying the json extremely quick.
Using this pairing should provide incredible performance per penny value! Let us know what you end up using.
Maybe you can create a time task by SQL server Agent.
SQL server Agent--new job--Steps--new step:
In the Command, put in your Import JSON documents from Azure Blob Storage sql statemanets for example.
Schedules--new schedule:
Set Execution time.
But I think Azure function is better for you to do this.Azure Functions is a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP.
It is more intuitive and efficient.
Hope this helps.
If you could set the default top N values in your api, then you could use web activity in azure data factory to call your rest api to get the response data.Then configure the response data as input of copy activity(#activity('ActivityName').output) and the sql database as output. Please see this thread :Use output from Web Activity call as variable.
The web activity support authentication properties for your access token.
Also I am thinking of pushing all JSON into Blob store and then
parsing data from the JSON into SQL Azure. Any recommendations?
Well,if you could dump the data into blob storage,then azure stream analytics is the perfect choice for you.
You could run the daily job to select or parse the json data with asa sql ,then dump the data into sql database.Please see this official sample.
One thing to consider for scale would be to parallelize both the query and the processing. If there is no ordering requirement, or if processing all records would take longer than the 10 minute function timeout. Or if you want to do some tweaking/transformation of the data in-flight, or if you have different destinations for different types of data. Or if you want to be insulated from a failure - e.g., your function fails halfway through processing and you don't want to re-query the API. Or you get data a different way and want to start processing at a specific step in the process (rather than running from the entry point). All sorts of reasons.
I'll caveat here to say that the best degree of parallelism vs complexity is largely up to your comfort level and requirements. The example below is somewhat of an 'extreme' example of decomposing the process into discrete steps and using a function for each one; in some cases it may not make sense to split specific steps and combine them into a single one. Durable Functions also help make orchestration of this potentially easier.
A timer-driven function that queries the API to understand the depth of pages required, or queues up additional pages to a second function that actually makes the paged API call
That function then queries the API, and writes to a scratch area (like Blob) or drops each row into a queue to be written/processed (e.g., something like a storage queue, since they're cheap and fast, or a Service Bus queue if multiple parties are interested (e.g., pub/sub)
If writing to scratch blob, a blob-triggered function reads the blob and queues up individual writes to a queue (e.g., a storage queue, since a storage queue would be cheap and fast for something like this)
Another queue-triggered function actually handles writing the individual rows to the next system in line, SQL or whatever.
You'll get some parallelization out of that, plus the ability to start from any step in the process, with a correctly-formatted message. If your processors encounter bad data, things like poison queues/dead letter queues would help with exception cases, so instead of your entire process dying, you can manually remediate the bad data.
I have an API endpoint for an event store to which I can query a get request and
receive a feed of events in ndjson format. I need to automate the collection of these events and store them in a database. As these events are in a nested json structure where some of the events have a complex structure, I was thinking of storing them in a document database. Can you please help me with the options I have in capturing these events and storing them w.r.t. the python libraries/frameworks that I can use to achieve this? To understand the events I was able to use REQUESTS library and get the events. I also tried asyncio and aiohttp to try to get these events asynchronously but that ran slower than requests run. can we create any pipeline using to get these events from the endpoint at frequent intervals?
Also some of these nested json keys have dots, MongoDB is not allowing to store them. I tried CosmosDB as well and it worked fine (only thing there was, if the json has a key "ID", it has to be unique. As these json feeds have ID key which is not unique, I had to rename the dict key before storing into cosmosdb).
Thanks,
Srikanth
I have created a azure function on powershell which works on http hit . It is writing a JSON file at its root folder after processing it. But if multiple hit occurs at same time then it throws file in use error. I know azure function doesn't work good on multi threading and variables can be modified while running one process and 2nd process occurs. I don't want to use queue storage so any good suggestion how to do it?
DO NOT WRITE ANYTHING TO THE AZURE FUNCTIONS FILESYSTEM THAT YOU DO NOT WANT TO LOSE.
Use Cosmos DB or some other external data store to store your data.
Without a code sample, it is hard to say what you might be doing wrong. You should be able to do this fine, from a code point of view, but you need to check if the file is in use and handle errors when it is (i.e. wait or throw a 429 error/etc.)