Let's say, I have 1000 documents in a Firestore collection.
How do I execute the same 1 Cloud Function but 10 times in parallel to process 100 documents each, say every 5 minutes?
I am aware I can use a Scheduler for the "every 5 minutes" part. The objective here is to distribute the load using multiple executions of the same function in parallel to handle the tasks. When the collection grows, I would like to add more instances. For example, let's say 1 execution per 100 documents.
I don't mind having another (or more) function to handle the distribution itself, and I don't mind the number of executions. I just don't want to loop through a large collection and process the tasks in a single function execution.
The numbers given above are examples. I am also open to using other services within GCP.
If you wanna execute the Cloud Function every time some changes occur in the Firestore documents, then you can use Cloud Firestore Trigger in Cloud Functions. The Cloud Function basically waits for changes, triggers when an event occurs and performs its tasks. You can go through these documents on Firestore triggers: Google Cloud Firestore Trigger, Cloud Firestore Triggers.
In case you are concerned that Cloud Function will not be able to process the requests parallely, then you should check out this document. Cloud Functions handle incoming requests by assigning it to an instance, in case the volume of requests increases, the Cloud Functions will start new instances to handle the requests.
Let's assume you have a function that, when called, process the single document and does anything you need with it. Let's call that function doSomething and let's assume it takes the document's path as parameter.
Then, you can create a function that will be scheduled every 5 minutes. In this function, you'll retrieve all the documents, holding them in an array (let's call it documents) and do something like:
const doSomething = httpsCallable(functions, 'doSomething');
let calls = [];
documents.map((document) => {
calls.push(
doSomething({path: document.path})
);
});
await Promise.all(calls);
This will create an array of calls, then it will fire all the calls at once, obtaining parallel executions of the same function.
Related
I have an external API which invokes my HTTP trigger azure function with the same query parameters 5 times at the same moment. So 5 requests are processed in the same time concurrently, each request adds a record to my google sheet and it causes unwanted duplicated records. My function is checking for duplicate in that sheet before pushing new record but when 5 instances are called a the same time concurrently, duplicate does not exist. Is there any simple solution to achieve processing those 5 request one by one, without concurrency?
If I schedule a timer triggered Azure function to run every second and my function is taking 2 seconds to execute, will I just get back-to-back executions or will some execution queue eventually overflow?
Background:
We have an timer triggered Azure function that is currently executing every 30 seconds and is checking for new rows in a database table. If there are new rows, the data will be processed and the rows will be marked as handled.
If there are no new rows the execution is very fast. If there are 500 new rows (which is the max we are fetching at the moment) the execution takes about 20-25 seconds.
We would like to decrease the interval to one second to reduce the latency or row processing.
Update: I want back-to-back executions and I want to avoid overlapping executions.
Multiple azure functions can run concurrently. This is means you can still trigger the function again while the previous triggered function is still running. They will both run concurrently. They will only queue up if you setup options to run only 1 function at a time on 1 instance but doesn't look like you want that.
With concurrency, this means that 2 functions will read the same table on the DB at the same time. So you should read your table with UPDLOCK option LINK. This will prevent the subsequent triggered function from reading the same rows that were read in the previous function.
In short, the answer to your question is neither. If your functions overlap, by default, you will get multiple functions running at the same time. LINK
To achieve back to back execution for time triggers, set WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT and FUNCTIONS_WORKER_PROCESS_COUNT as 1 in the application settings configuration. This will ensure only 1 function executes runs at a time . See this LINK.
I'm working with DocumentDB in AWS, and I've been having troubles when I try to read from the same collection simultaneously from different aggregation queries.
The issue is not that I cannot read from the database, but rather that it takes a lot of time to complete the queries. It doesn't matter if I trigger the queries simultaneously or one after the other.
I'm using a Lambda Function with NodeJS to run my code. And I'm using mongoose to handle the connection with the database.
Here's a sample code that I put together to illustrate my problem:
query1() {
return Collection.aggregate([...])
}
query2() {
return Collection.aggregate([...])
}
query3() {
return Collection.aggregate([...])
}
It takes the same time if I run it using Promise.all
Promise.all([ query1(), query2(), query3() ])
Than if I run it waiting for the previous one to finish
query1().then(result1 => query2().then(result3 => query3()))
While if I run each query in different Lambda Executions, it takes significantly less time for each individual query to finish (Between 1 and 2 seconds).
So if they were running in parallel the execution should be finished with the time of the query that takes the most time (2 seconds), and not take 7 seconds, as it does now.
So my guessing is that the instance of DocumentDB is running the queries in sequence no matter how I send them. In the collection there are around 19,000 documents with a total size of almost 25Mb.
When I check the metrics of the instance, the CPUUtilization is barely over 8% and the RAM available only drops by 20Mb. So I don't think the problem of the delay has to do with the size of the instance.
Do you know why DocumentDB is behaving like this? Is there a configuration that I can change to run the aggregations in parallel?
What would be a good approach to running a repetitive task for each row in a large postgres db table on a different per row interval in Node.js.
To give you some more context, here's a quick description of the application:
It's a chat based customer support app.
It consists of teams, which can be either a client team or a support team. Teams have users, which can be either client users or support users.
Client users send messages to a support team and wait for one of that team's users to answer their question.
When there's an unanswered client message waiting for a response, every agent for the receiving support team will receive a notification every n seconds (n being set on a per-team basis by the team admin).
So this task needs to infinitely loop through the rows in the teams table and send notifications if:
The team has messages waiting to be answered.
N seconds have passed since the last notification was sent (N being the number of seconds set by the team admin).
There might be a better approach to this condition altogether.
So my questions are:
What is an efficient way to infinitely loop through a postgres table with no upper limit on the number rows?
Should I load 1 row at a time? Several at a time?
What would be a good way to do this in Node?
I'm using Knex. Does Knex provide a mechanism for lazy loading a table and iterating through the rows?
A) Running a repetitive task via node can be done via a the js built-in function 'setInterval'.
// run the intervalFnc() every 5 seconds
const timerId = setTimeout(intervalFnc, 5000);
function intervalFnc() { console.log("Hello"); }
// to quit running it:
clearTimeout(timerId);
Then your interval function can do the actual work. An alternative would be to use cron (linux), or some OS process scheduler to trigger the function. I would use this method if you want to do it every minute, and a cron job if you want to do it every hour (in between these times becomes more debatable).
B) An efficient way...
B-1) Retrieving a block of records from a DB will be more efficient than one at a time. Knex has .offset and .limit clauses to choose a group of records to retrieve. A sample from the knex doc:
knex.select('*').from('users').limit(10).offset(30)
B-2) Database indexed access is important for performance if your tables are very large. I would recommend including an status flag field in your table to note which records are 'in-process', and also include a "next-review-timestamp" field with both fields being both indexed. Retrieve the records that have status_flag='in-process' AND next_review_timestamp <= now(). Sample:
knex('users').where('status_flag', 'in-process').whereRaw('next_review_timestamp <= now()')
Hope this helps!
Background
I have a Node and React based application. I'm using Firebase for my storage and database. In my application users can fill out a form where they upload an image and select a time for the image to be added to their website. I save each image update as an object in my Firebase database like so. Images are arranged in order of ascending update time.
user-name: {
images: [
{
src: 'image-src-url',
updateTime: 1503953587727
}
{
src: 'image-src-url',
updateTime: 1503958424838
}
]
}
Scale
My applications db could potentially get very large with a lot of users and images. I'd like to ensure scalability.
Issue
How do I check when a specific image objects time has been met then execute a function? (I do not need assistance on the actual function that is being run just the checking of the db for a specific time.)
Attempts
I've thought about doing a cron job using node-cron that checks the entire database every 60s (users can only specify the minute the image will update, not the seconds.) Then if it finds a matching updateTime and executes my function. My concern is at a large scale that cron job will take a while to search the db and potentially miss a time.
I've also thought about when the user schedules a new update then dynamically create a specific cron job for that time. I'm unsure how to accomplish this.
Any other methods that may work? Are my concerns about node-cron not valid?
There are two approaches I can think of:
Keep track of the last timestamp you processed
Keep the "things to process" in a queue
Keep track of the last timestamp you processed
When you process items, you use the current timestamp as the cut-off point for your query. Something like:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
Now make sure to store this now somewhere (i.e. in your database) so that you can re-use it next time to retrieve the next batch of items:
var previous = ... previous value of now
var now = Date.now();
var query = ref.orderByChild("updateTime").startAt(previous).endAt(now);
With this you're only processing a single slice at a time. The only tricky bit is that somebody might insert a new node with an updateTime that you've already processed. If this is a concern for your use-case, you can prevent them from doing so with a validation rule on updateTime:
".validate": "newData.val() >= root.child('lastProcessed').val()"
As you add more items to the database, you will indeed be querying more items. So there is a scalability limit to this approach, but this approach should work well for anything up to a few hundreds of thousands of nodes (I haven't tested in a while so ymmv).
For a few previous questions on list size:
Firebase Performance: How many children per node?
Firebase Scalability Limit
How many records / rows / nodes is alot in firebase?
Keep the "things to process" in a queue
An alternative approach is to keep a queue of items that still need to be processed. So the clients add the items that they want processed to the queue with an updateTime of when they want to processed. And your server picks the items from the queue, performs the necessary updates, and removes the item from the queue:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
query.once("value").then(function(snapshot) {
snapshot.forEach(function(child) {
// TODO: process the child node
// remove the child node from the queue
child.ref.remove();
});
})
The difference with the earlier approach is that a queue's stable state is going to be empty (or at least quite small), so your queries will run against a much smaller list. That's also why you won't need to keep track of the last timestamp you processed: any item in the queue up to now is eligible for processing.