I have an external API which invokes my HTTP trigger azure function with the same query parameters 5 times at the same moment. So 5 requests are processed in the same time concurrently, each request adds a record to my google sheet and it causes unwanted duplicated records. My function is checking for duplicate in that sheet before pushing new record but when 5 instances are called a the same time concurrently, duplicate does not exist. Is there any simple solution to achieve processing those 5 request one by one, without concurrency?
Related
Let's say, I have 1000 documents in a Firestore collection.
How do I execute the same 1 Cloud Function but 10 times in parallel to process 100 documents each, say every 5 minutes?
I am aware I can use a Scheduler for the "every 5 minutes" part. The objective here is to distribute the load using multiple executions of the same function in parallel to handle the tasks. When the collection grows, I would like to add more instances. For example, let's say 1 execution per 100 documents.
I don't mind having another (or more) function to handle the distribution itself, and I don't mind the number of executions. I just don't want to loop through a large collection and process the tasks in a single function execution.
The numbers given above are examples. I am also open to using other services within GCP.
If you wanna execute the Cloud Function every time some changes occur in the Firestore documents, then you can use Cloud Firestore Trigger in Cloud Functions. The Cloud Function basically waits for changes, triggers when an event occurs and performs its tasks. You can go through these documents on Firestore triggers: Google Cloud Firestore Trigger, Cloud Firestore Triggers.
In case you are concerned that Cloud Function will not be able to process the requests parallely, then you should check out this document. Cloud Functions handle incoming requests by assigning it to an instance, in case the volume of requests increases, the Cloud Functions will start new instances to handle the requests.
Let's assume you have a function that, when called, process the single document and does anything you need with it. Let's call that function doSomething and let's assume it takes the document's path as parameter.
Then, you can create a function that will be scheduled every 5 minutes. In this function, you'll retrieve all the documents, holding them in an array (let's call it documents) and do something like:
const doSomething = httpsCallable(functions, 'doSomething');
let calls = [];
documents.map((document) => {
calls.push(
doSomething({path: document.path})
);
});
await Promise.all(calls);
This will create an array of calls, then it will fire all the calls at once, obtaining parallel executions of the same function.
I have a Python Flask page which is extremely slow to generate. It takes about 1 minute to pull all the data from external APIs, process the data before returning the page. Fortunately, the data is valid for up to 1 hour so I can cache the result and return cached results quickly for most of the requests.
This works well except for the minute after the cache expires. If 10 requests were made within that single minute, there will be 10 calls to veryslowpage() function, this eats up the HTTPS connection pool due to the external API calls and eats up memory due to the data processing, affecting other pages on the site. Is there a method to limit this function to a single instance, so 10 requests will result in only 1 call to veryslowpage() while the rest wait until the cached result is ready?
from flask import Flask, request, abort, render_template
from flask_caching import Cache
#app.route('/veryslowpage', methods=['GET'])
#cache.cached(timeout=3600, query_string=True)
def veryslowpage():
data = callexternalAPIs()
result = heavydataprocessing(data)
return render_template("./index.html", content=result)
You could simply create a function that periodically fetch the data from API (every hour) and store it in your database. Then in your route function read the data from your database instead of external API.
A better approach is creating a very simple script and call it (in another thread) in your app/init.py that fetch the data every one hour and update the database.
You could create a file or a database entry that contains the information that you are calculating the response in a different thread. Then, your method would check if such a file exists and if it does, let it wait and check for the response periodically.
You could also proactively create the data once every hour (or every 59 minutes if that matters) so you always have a fresh response available. You could use something like APScheduler for this.
If I schedule a timer triggered Azure function to run every second and my function is taking 2 seconds to execute, will I just get back-to-back executions or will some execution queue eventually overflow?
Background:
We have an timer triggered Azure function that is currently executing every 30 seconds and is checking for new rows in a database table. If there are new rows, the data will be processed and the rows will be marked as handled.
If there are no new rows the execution is very fast. If there are 500 new rows (which is the max we are fetching at the moment) the execution takes about 20-25 seconds.
We would like to decrease the interval to one second to reduce the latency or row processing.
Update: I want back-to-back executions and I want to avoid overlapping executions.
Multiple azure functions can run concurrently. This is means you can still trigger the function again while the previous triggered function is still running. They will both run concurrently. They will only queue up if you setup options to run only 1 function at a time on 1 instance but doesn't look like you want that.
With concurrency, this means that 2 functions will read the same table on the DB at the same time. So you should read your table with UPDLOCK option LINK. This will prevent the subsequent triggered function from reading the same rows that were read in the previous function.
In short, the answer to your question is neither. If your functions overlap, by default, you will get multiple functions running at the same time. LINK
To achieve back to back execution for time triggers, set WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT and FUNCTIONS_WORKER_PROCESS_COUNT as 1 in the application settings configuration. This will ensure only 1 function executes runs at a time . See this LINK.
I have an application that it's gathering data from another application, both in NodeJS.
I was wondering, how can I trigger sending the data to a third application on certain conditions? For example, every 10 mins if there's data in a bucket or when I have 20 elements to send?
And if the call on the third parties fails, how can I repeat it after 10-15 mins?
EDIT:
The behaviour should be something like:
if you have 1 data posted (axios.post) AND [10 mins passed OR other 10 data posted] SUBMIT to App n.3
What can help me doing so? Can I keep the value saved until those requirements are satisfied?
Thank you <3
You can use packages like node-schedule which is popular to schedule tasks. When callback runs check if there is enough data(posts) to send.
What would be a good approach to running a repetitive task for each row in a large postgres db table on a different per row interval in Node.js.
To give you some more context, here's a quick description of the application:
It's a chat based customer support app.
It consists of teams, which can be either a client team or a support team. Teams have users, which can be either client users or support users.
Client users send messages to a support team and wait for one of that team's users to answer their question.
When there's an unanswered client message waiting for a response, every agent for the receiving support team will receive a notification every n seconds (n being set on a per-team basis by the team admin).
So this task needs to infinitely loop through the rows in the teams table and send notifications if:
The team has messages waiting to be answered.
N seconds have passed since the last notification was sent (N being the number of seconds set by the team admin).
There might be a better approach to this condition altogether.
So my questions are:
What is an efficient way to infinitely loop through a postgres table with no upper limit on the number rows?
Should I load 1 row at a time? Several at a time?
What would be a good way to do this in Node?
I'm using Knex. Does Knex provide a mechanism for lazy loading a table and iterating through the rows?
A) Running a repetitive task via node can be done via a the js built-in function 'setInterval'.
// run the intervalFnc() every 5 seconds
const timerId = setTimeout(intervalFnc, 5000);
function intervalFnc() { console.log("Hello"); }
// to quit running it:
clearTimeout(timerId);
Then your interval function can do the actual work. An alternative would be to use cron (linux), or some OS process scheduler to trigger the function. I would use this method if you want to do it every minute, and a cron job if you want to do it every hour (in between these times becomes more debatable).
B) An efficient way...
B-1) Retrieving a block of records from a DB will be more efficient than one at a time. Knex has .offset and .limit clauses to choose a group of records to retrieve. A sample from the knex doc:
knex.select('*').from('users').limit(10).offset(30)
B-2) Database indexed access is important for performance if your tables are very large. I would recommend including an status flag field in your table to note which records are 'in-process', and also include a "next-review-timestamp" field with both fields being both indexed. Retrieve the records that have status_flag='in-process' AND next_review_timestamp <= now(). Sample:
knex('users').where('status_flag', 'in-process').whereRaw('next_review_timestamp <= now()')
Hope this helps!