Firebase Functions Batch Processing / Time Limit - node.js

I am using Firebase Functions and I have a pubsub function that runs every night.
This function is responsible for doing some processing on every user's account (such as preparing some data for them for the next day and sending them an email).
The issue I'm having is that there is a time limit on how long a function can take to run, which for event-driven functions, is 10 minutes.
https://firebase.google.com/docs/functions/quotas#:~:text=Time%20Limits,-Quota&text=60%20minutes%20for%20HTTP%20functions,minutes%20for%20event%2Ddriven%20functions.
Now that my number of users has scaled significantly, this limit is no longer sufficient to complete the work for all users.
To help with this, I created a firestore.onWrite event function that fires when a certain document is written. Now, when my pubsub function runs, it writes to a document in firestore, which then triggers my onWrite function to run. This seems to have allowed me to process more users than previously, However, 1) this just feels wrong, and 2) it's still not sufficient as there are rate-limits to writing documents that I've run into as well and I just can't squeeze all my users into the time limit allowed.
What is the correct approach for doing nightly batches (on user accounts), preferably within the Firebase ecosystem?

As you correctly pointed out and as it is mentioned in the documentation, the maximum amount of time a function can run before being forcibly terminated is 10 mins for an event-driven function.
However, in order to circumvent that limitation I found this SO post of another person that had a similar issue to you.Most of the recommendations are batching it under 500 and then committing.
Here, have a look at the solution here.
Hope this helps.

Related

How to handle multiple API Requests

I am working with the Google Admin SDK to create Google Groups for my organization. I can't add members to the group when creating the group, ideally, when I create a new group I'd like to add roughly 60 members. In addition, the ability to add members after the group is created in bulk (a batch request) was deprecated August 2020. Right now, after I create a group I need to make a web request to the API to add each member of the group individually (which will be about 60 members).
I am using node.js and express, is there a good way to handle 60 web requests to an api? I don't know how taxing this will be on my server. If anyone has any resources to share where I can learn about the impact this would have on a nodejs server that would be great.
Fortunately, these groups aren't created often, maybe 15 a month.
One idea I have is to offload the work to something like a cloud function so my node server makes one request to the cloud function, then the cloud function makes all the additional requests to add members to the Group. I'm not 100% sure if this is necessary and I'm curious on other approaches.
Limits and Quotas
Note that adding group members may take 10 minutes to propagate.
The rate limit for the Directory API is 3000 queries per 100 seconds per IP address. This works out to around 30 per second. 60 requests is not a large amount of requests, but if you try to send them all in a few milliseconds the system may extrapolate the rate and deem it over the limit, I wouldn't think so, though probably best to test it on your end with your system and connection etc.
Exponential Backoff
If you do need to make many requests this is the method Google recommends. It involves repeating the request if it fails and then exponentially increasing the amount of time to wait until it reaches 16 seconds. You can always implement a longer wait to retry. Its only 60 requests after all.
Batch Requests
The previously mentioned methods should work no issue for you, since there are only 60 requests to make, it won't put any "stress" on the system as such. That said, the most performant way to handle many requests to the Directory API is to use a batch request. This allows you to bundle up all your member requests into one large batch, of up to 1000 calls. This will also give you a nice cushion in case you need to increase your requests in future!
EDIT - I am sorry, I missed that you mentioned that Batching is deprecated. Only global batching is deprecated. If you send a batch request to a specific API, batching will still be supported. What you can no longer do is send a single batch request to different APIs, like Directory and Sheets in one.
References
Limits and Quotas
Exponential Backoff
Batch Requests

Process long arrays in Cloud Functions

I'am developing an application that makes users able to broadcast videos. As many social network do, users need to receive a notification when someone goes live. To do so I'am using Cloud Functions and i pass to the functions the array of the users that must receive the notification; for every user of the array I need to extract the FCM TOKEN from the server and then send the notification.
For arrays of 10 / 20 Users the functions doesn't take so long, but for 150/300 users sometimes I get timeout or a very slow execution.
So my question is: Is it possible to divide the array in groups of 20/30 users and process many arrays at same time??
Thanks
There is 2 way to answer this
From a development point of view, some languages are easier for allowing the concurrent processing (Go is very handy for this). So, like this, because you spend a lot of time in API call (FCM), it's a first solution to perform several calls concurrently
From an architecture point of view, PubSub and/or Cloud Task are well designed for this
Your first function only creates chunk of message to send and posts them to Cloud Task or PubSub
Your second function receives the chunks and sends the messages. The chunks are processed in parallel on several functions.

Calling external API only when new data is available

I am serving my users with data fetched from an external API. Now, I don't know when this API will have new data, how would be the best approach to do that using Node, for example?
I have tried setInterval's and node-schedule to do that and got it working, but isn't it expensive for the CPU? For example, over a day I would hit this endpoint to check for new data every minute, but it could have new data every five minutes or more.
The thing is, this external API isn't ran by me. Would the only way to check for updates hitting it every minute? Is there any module that can do that in Node or any approach that fits better?
Use case 1 : Call a weather API for every city of the country and just save data to my db when it is going to rain in a given city.
Use case 2 : Send notification to the user when a given Philips Hue lamp is turned on at the time it is turned on without having to hit the endpoint to check if it is on or not.
I appreciate the time to discuss this.
If this external API has no means of notifying you when there's new data, then the only thing you can do is to "poll" it to check for new data.
You will have to decide what an "efficient design" for polling is in your specific application and given the type of data and the needs of the client (what is an acceptable latency for new data).
You also need to be sure that your service is not violating any terms of service with your polling scheme or running afoul of rate limiting that may deny you access to the server if you use it "too much".
Would the only way to check for updates hitting it every minute?
Unless the API offers some notification feature, there is no other scheme other than polling at some interval. Polling every minute is fairly quick. Do your clients really need information that is less than a minute old? Or would it really make no difference if the information was as much as 5 minutes old.
For example, in your example of weather, a client wouldn't really need temperature updates more often than probably every 10-15 minutes.
Is there any module that can do that in Node or any approach that fits better?
No. Not really. You'll probably just use some sort of timer (either repeated setTimeout() or setInterval() in a node.js app to repeatedly carry out your API operations.
Use case: Call a weather API for every city of the country and just save data to my db when it is going to rain in a given city.
Trying to pre-save every possible piece of data from an external API is probably a losing proposition. You're essentially trying to "scrape" all the data from the external API. That is likely against the terms of service and will likely also run afoul of rate limits. And, it's just not very practical.
Instead, you will probably want to fetch data upon demand (when a client requests data for Phoenix, then, and only then, do you start collecting data for Phoenix) and then once a demand for a certain type of data (temperatures in a particular city) is established, then you might want to pre-cache that data more regularly so you can notify clients of changes. If, after awhile, no clients are asking for data from Phoenix, you stop requesting updates for Phoenix any more until a client establishes demand again.
I have tried setInterval's and node-schedule to do that and got it working, but isn't it expensive for the CPU? For example, over a day I would hit this endpoint to check for new data every minute, but it could have new data every five minutes or more.
Making a remote network request is not a CPU intensive operation, even if you're doing it every minute. node.js uses non-blocking networking so most of the time during a network request, node.js isn't doing anything and isn't using the CPU at all. The only time the CPU would be briefly used is when you first send the API request and then when you receive back the result from the API call and need to process it.
Whether you really need to "poll" every minute depends upon the data and the needs of the client. I'd ask yourself if your app will work just fine if you check for new data every 5 minutes.
The method I would use to update would be contained outside of the code in a scheduled batch/powershell/bash file. In windows you can schedule tasks based upon time of day or duration since last run, so what you could do is run a simple command that will kill your application for five minutes, run npm update, and then restart your application before closing the shell.
That way you're staying out of your API and keeping code to a minimum, and if your code is inside that Node package in the update, it'll be there and ready once you make serious application changes or you need to take the server down for maintenance and updates to the low-level code.
This is a light-weight solution for you and it's a method I've used once or twice at my workplace. There are lots of options out there, and if this isn't what you're looking for I can keep looking out for you.

What is the best way to keep local copy of Firebase Database on node.js

I have an app where I need to check people's posts constantly. I am trying to make sure that the -server- handles more than 100,000 posts. I tried to explain the program and specify the issues I am worried about by numbers.
I am running a simple node.js program on my terminal that runs as firebase admin controlling the Firebase Database. The program has no connectivity with clients(users), it just keeps the database locally to check users' posts every 2-3 seconds. I am keeping the posts in local hash variables by using on('child_added') to simply push the post to a posts hash and so on for on('child_removed') and on('child_changed').
Are these functions able to handle more than 5 requests per second?
Is this the proper way of keeping data locally for faster processing(and not abusing firebase limits)? I need to check every post on the platform every 2-3 seconds, so I am trying to keep a local copy of the -posts data.
That local copy of the posts are looped through every 2-3 seconds.
If there are thousands of posts, will a simple array variable handle that load?
Second part of the program:
I run a for loop to loop through the posts in a function. I run the function every 2-3 seconds using setInterval(). The program needs not only to check new added posts but it constantly needs to check all posts on the database.
If(specific condition for a post) => the program changes the state of the post
.on(child_changed) function => sends an API request to a website after that state change
Can this function run asynchronously ? When it is called, the function should not wait for the previous call to finish because the old call is sending an API request and it might not complete fast. How can I make sure that .on(child_changed) doesn't miss a single change on the -posts data?
Listen for Value Events documentation shows how to observe changes, namely one uses the .on method.
In terms of backing up your Realtime Database, you simply export the data manually, or if you have the paid plan you can automate it.
I don't understand why you would want to recreate the wheel, so to speak, and have your server ping firebase for updates. Simply use firebase observers.

SQS: Know remaining jobs

I'm creating an app that uses a JobQueue using Amazon SQS.
Every time a user logs in, I create a bunch of jobs for that specific user, and I want him to wait until all his jobs have been processed before taking the user to a specific screen.
My problem is that I don't know how to query the queue to see if there are still pending jobs for a specific user, or how is the correct way to implement such solution.
Everything regarding the queue (Job creation and processing is working as expected). But I am missing that final step.
Just for the record:
In my previous implementation I was using Redis + Kue and I had created a key with the user Id and the job count, every time a job was added that job count was incremented, and every time a job finished or failed I decremented that count. But now I want to move away from Redi + Kue and I am not sure how to implement this step.
Amazon SQS is not the ideal tool for the scenario you describe. A queueing system is normally used in a "Send and Forget" situation, where the sending system doesn't remain interested in later processing.
You could investigate Amazon Simple Workflow (SWF), which allows work to be monitored as it goes through several processes. Your existing code could mostly be re-used, just with the SWF framework added. Or even power it from Lambda, since you are already using node.js.

Resources