Slow Firebase Firestore cold starts in Cloud Functions

Slow Firebase Firestore cold starts in Cloud Functions - node.js

On a cold start (after deploying or after 3hrs) the function to request a document from Firestore takes an incredible amount of time which is different to when it's used rapidly.
Cold Start:
Function execution took 4593 ms, finished with status code: 200
Rapid fire (me sending using the same function over and over):
Function execution took 437 ms, finished with status code: 200
My code for getting the documents is quite simple:
function getWorkspaceDocument(teamSpaceId) {
return new Promise((resolve, reject) => {
var teamRef = db.instance.collection('teams').doc(teamSpaceId);
teamRef.get().then(doc => {
if (doc.exists) {
resolve(doc.data());
return;
}
else {
reject(new Error("Document cant be found"));
return;
}
}).catch(error => {
reject(new Error("Document cant be found"));
});
});
}
I'm trying to make a Slack bot and the slow returns on Firebase Firestore throw time outs in Slacks API. Is there a way on Firebase to stop cold starts from happening and letting it persist through out?

If the cloud function needs to start a new instance your cold start time seems normal. This is one drawback of a serverless function.
I think there is a problem with your implementation. Could you show more details?
Here is a nice little video about this topic:
https://youtu.be/v3eG9xpzNXM

firebaser here
We actually just released a new preferRest API that should considerably improve the cold start times for Cloud Functions that use Firestore. The documentation for it is not very complete, but you can enable the feature with:
import { initializeFirestore }
from 'firebase-admin/firestore';
const app = initializeApp();
const firestore = initializeFirestore(app,
{ preferRest: true }); // 👈
firestore.collection(...);
With preferRest: true the Firestore Admin SDK uses the REST transport layer by default, and it then only loads and uses the gRPC libraries when it encounters an operation that needs them.
Since the gRPC libraries are quite big and the only operation that requires gRPC is creating a snapshot listener, this should reduce the cold start times for most Cloud Functions implementations significantly.
I haven't had a chance to test this myself yet and there are still some known issues, so YMMV and I'd love to hear specifics on what performance change you see from this.
Also see:
the written release note about this option
the Release Notes video where this is mentioned

Another thing I would suggest checking is the amount of memory allocated to a particular function. Each level selected increases non only the RAM, but the CPU frequency as well (and the costs, be careful and don't forget about the pricing calculator!). There is a direct dependency between the package size of your function and the cold-start (source: https://mikhail.io/serverless/coldstarts/gcp/).
I can see that you are using the Firestore admin package, which is not considered to be lightweight (source: https://github.com/firebase/firebase-admin-node/issues/238). Thus, 128MB configuration might not be enough.
For our project increasing the RAM from 128MB to 512MB decreased the cold boot 10x from 20 seconds to 2.5seconds on average. Be sure not to overlook this in case you have several dependencies (7 in our case).

Related

Google Pub/Sub with distributed subscribers in Node.js

We are attempting to migrate a message processing app from Kafka to Google Pub/Sub and it's just not working as expected.
We are running in Kubernetes (Google Cloud) where there may be multiple pods processing messages on the same subscription. Topics and subscriptions are all created using terraform and are more or less permanent. They are not created/destroyed on the fly by the application.
In our development environment, where message throughput is rather low, everything works just fine. But when we scale up to production levels, everything seems to fall apart. We get big backlogs of unacked messages, and yet some pods are not receiving any messages at all. And then, all of a sudden, the backlog will just go away, but then climb again.
We are using the nodejs client library provided by google: #google-cloud/pubsub:3.1.0
Each instance of the application subscribes to the same named subscription, and according to the documentation, messages should be distributed to each subscriber. But that is not happening. Some pods will be consuming messages rapidly, while others sit idle.
Every message is processed in a try/catch block and we are not observing any errors being thrown. So, as far as we know, every received message is getting acked.
I am suspicious that, as pods are terminated with autoscaling or updated deployments, that we are not properly closing subscriptions, but there are no examples addressing a distributed environment and I have not found any document that specifically addresses how to properly manage resources. It is also worth mentioning that the app has multiple subscriptions to different topics.
When a pod shuts down, what actions should be taken on the Subscription object and the PubSub client object? Maybe that's not even the issue, but it seems like a reasonable place to start.
When we start a subscription we do something like this:
private exampleSubscribe(): Subscription {
// one suggestion for having multiple subscriptions in the same app
// was to use separate clients for each
const pubSubClient = new PubSub({
// use a regional endpoint for message ordering
apiEndpoint: 'us-central1-pubsub.googleapis.com:443',
});
pubSubClient.projectId = 'my-project-id';
const sub = pubSubClient.subscription('my-subscription-name', {
// have tried various values for maxMessage from 5 to the default of 1000
flowControl: { maxMessages: 250, allowExcessMessages: false },
ackDeadline: 30,
});
sub.on('message', async (message) => {
await this.exampleMessageProcessing(message);
});
return sub;
}
private async exampleMessageProcessing(message: Message): Promise<void> {
try {
// do some cool stuff
} catch (error) {
// log the error
} finally {
message.ack();
}
}
Upon termination of a pod, we do this:
private async exampleCloseSub(sub: Subscription) {
try {
sub.removeAllListeners('message');
await sub.close();
// note that we do nothing with the PubSub
// client object -- should it also be closed?
} catch (error) {
// ignore error, we are shutting down
}
}
When running with Kafka, we can easily keep up with the message pace with usually no more than 2 pods. So I know that we are not running into issues of it simply taking too long to process each message.
Why are messages being left unacked? Why are pods not receiving messages when there is clearly a large backlog? What is the correct way to shut down one subscriber on a shared subscription?

It turns out that the issue was an improper implementation of message ordering.
The official docs for message ordering in Pub/Sub are rather brief:
https://cloud.google.com/pubsub/docs/ordering
Not much there regarding how to implement an ordering key or the implications of message ordering on horizontal scaling.
Though they do link to some external resources, one of which is this blog post:
https://medium.com/google-cloud/google-cloud-pub-sub-ordered-delivery-1e4181f60bc8
In our case, we did not have enough distinct ordering keys to allow for proper distribution of messages across subscribers/pods.
So this was definitely an RTFM situation, or more accurately: Read The Fine Blog Post Referred To By The Manual. I would have much preferred that the important details were actually in the official documentation. Is that to much to ask for?

How to avoid memory leak when using pub sub to call function?

I stuck on performance issue when using pubsub to triggers the function.
//this will call on index.ts
export function downloadService() {
// References an existing subscription
const subscription = pubsub.subscription("DOWNLOAD-sub");
// Create an event handler to handle messages
// let messageCount = 0;
const messageHandler = async (message : any) => {
console.log(`Received message ${message.id}:`);
console.log(`\tData: ${message.data}`);
console.log(`\tAttributes: ${message.attributes.type}`);
// "Ack" (acknowledge receipt of) the message
message.ack();
await exportExcel(message);//my function
// messageCount += 1;
};
// Listen for new messages until timeout is hit
subscription.on("message", messageHandler);
}
async function exportExcel(message : any) {
//get data from database
const movies = await Sales.findAll({
attributes: [
"SALES_STORE",
"SALES_CTRNO",
"SALES_TRANSNO",
"SALES_STATUS",
],
raw: true,
});
... processing to excel// 800k rows
... bucket.upload to gcs
}
The function above is working fine if I trigger ONLY one pubsub message.
However, the function will hit memory leak issue or database connection timeout issue if I trigger many pubsub message in short period of time.
The problem I found is, first processing havent finish yet but others request from pubsub will straight to call function again and process at the same time.
I have no idea how to resolve this but I was thinking implement the queue worker or google cloud task will solve the problem?

As mentioned by #chovy in the comments, there is a need to queue up the excelExport function calls since the function's execution is not keeping up with the rate of invocation. One of the modules that can be used to queue function calls is async. Please note that the async module is not officially supported by Google.
As an alternative, you can employ flow control features on the subscriber side. Data pipelines often receive sporadic spikes in published traffic which can overwhelm subscribers in an effort to catch up. The usual response to high published throughput on a subscription would be to dynamically autoscale subscriber resources to consume more messages. However, this can incur unwanted costs — for instance, you may need to use more VM’s — which can lead to additional capacity planning. Flow control features on the subscriber side can help control the unhealthy behavior of these tasks on the pipeline by allowing the subscriber to regulate the rate at which messages are ingested. Please refer to this blog for more information on flow control features.

Firebase Dynamic Pages using Cloud Functions with Firestore are slow

We have dynamic pages being served by Firebase Cloud Functions, but the TTFB is very slow on these pages with TTFB of 900ms - 2s, at first we just assumed it to be a cold start issue, but even with consistent traffic it is very slow at TTFB of 700ms - 1.2s.
This is a bit problematic for our project since it is organic traffic dependent and Google Pagespeed would need a server response of less than 200ms.
Anyway, we tried to check what might be causing the issue and we pinpointed it with Firestore, when a Cloud Function accesses Firestore, we noticed there are some delays. This is a basic sample code of how we implement Cloud Function and Firestore:
dynamicPages.get('/ph/test/:id', (req, res) => {
var globalStartTime = Date.now();
var period = [];
db.collection("CollectionTest")
.get()
.then((querySnapshot) => {
period.push(Date.now() - globalStartTime);
console.log('1', period);
return db.collection("CollectionTest")
.get();
})
.then((querySnapshot) => {
period.push(Date.now() - globalStartTime);
console.log('2', period);
res.status(200)
.send('Period: ' + JSON.stringify(period));
return true;
})
.catch((error) => {
console.log(error);
res.end();
return false;
});
});
This is running on Firebase + Cloud Functions + NodeJS
CollectionTest is very small with only 100 documents inside, with each document having the following fields:
directorName: (string)
directorProfileUrl: (string)
duration: (string)
genre: (array)
posterUrl: (string)
rating: (string)
releaseDate: (string)
status: (int)
synopsis: (string)
title: (string)
trailerId: (string)
urlId: (string)
With this test, we would get the following results:
[467,762] 1.52s
[203,315] 1.09s
[203,502] 1.15s
[191,297] 1.00s
[206,319] 1.03s
[161,267] 1.03s
[115,222] 843ms
[192,301] 940ms
[201,308] 945ms
[208,312] 950ms
This data is [Firestore Call 1 Exectution Time, Firestore Call 2 Exectution Time] TTFB
If we check the results of the test, there are signs that the TTFB is getting lower, maybe that is when the Cloud Function has already warmed up? But even so, Firestore is eating up 200-300ms in the Cloud Function based on the results of our second Firestore Call and even if Firestore took lesser time to execute, TTFB would still take up 600-800ms, but that is a different story.
Anyway, can anyone help how we can improve Firestore performance in our Cloud Functions (or if possible, the TTFB performance)? Maybe we are doing something obviously wrong that we don't know about?

I will try to help, but maybe lacks a bit of context about what you load before returning dynamicPages but here some clues:
First of all, the obvious part (I have to point it anyway):
1 - Take care how you measure your TTFB:
Measuring TTFB remotely means you're also measuring the network
latency at the same time which obscures the thing TTFB is actually
measuring: how fast the web server is able to respond to a request.
2 - And from Google Developers documentation about Understanding Resource Timing (here):
[...]. Either:
Bad network conditions between client and server, or
A slowly responding server application
To address a high TTFB, first cut out as much network as possible.
Ideally, host the application locally and see if there is still a big
TTFB. If there is, then the application needs to be optimized for
response speed. This could mean optimizing database queries,
implementing a cache for certain portions of content, or modifying
your web server configuration. There are many reasons a backend can be
slow. You will need to do research into your software and figure out
what is not meeting your performance budget.
If the TTFB is low locally then the networks between your client and
the server are the problem. The network traversal could be hindered by
any number of things. There are a lot of points between clients and
servers and each one has its own connection limitations and could
cause a problem. The simplest method to test reducing this is to put
your application on another host and see if the TTFB improves.
Not so obvious ones:
You can take a look at the official Google documentation regarding Cloud Functions Performance here: https://cloud.google.com/functions/docs/bestpractices/tips
Did you require some files before?
According to this answer from Firebase cloud functions is very slow: Firebase cloud functions is very slow:
Looks like a lot of these problems can be solved using the hidden
variable process.env.FUNCTION_NAME as seen here:
https://github.com/firebase/functions-samples/issues/170#issuecomment-323375462
Are these dynamic pages loaded being accessed by a guest user or a logged user? Because maybe the first request has to sort out the authentication details, so it's known to be slower...
If nothing of this works, I will take a look at the common performance issues like DB connection (here: Optimize Database Performance), improving server configuration, cache all you can and take care of possible redirections in your app...
To end, reading through Internet, there are a lot of threads with your problem (low performance on simple Cloud Functions). Like this one: https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2374 && in S.O: https://stackoverflow.com/search?q=%5Bgoogle-cloud-functions%5D+slow
With comments like:
since when using cloud functions, the penalty is incurred on each http
invocation the overhead is still very high (i.e. 0.8s per HTTP call).
or:
Bear in mind that both Cloud Functions and Cloud Firestore are both in
beta and provide no guarantees for performance. I'm sure if you
compare performance with Realtime Database, you will see better
numbers.
Maybe it is still an issue.
Hope it helps!

Firebase Functions - ERROR, but no Event Message in Console

I have written a function on firebase that downloads an image (base64) from firebase storage and sends that as response to the user:
const functions = require('firebase-functions');
import os from 'os';
import path from 'path';
const storage = require('firebase-admin').storage().bucket();
export default functions.https.onRequest((req, res) => {
const name = req.query.name;
let destination = path.join(os.tmpdir(), 'image-randomNumber');
return storage.file('postPictures/' + name).download({
destination
}).then(() => {
res.set({
'Content-Type': 'image/jpeg'
});
return res.status(200).sendFile(destination);
});
});
My client calls that function multiple times after one another (in series) to load a range of images for display, ca. 20, of an average size of 4KB.
After 10 or so pictures have been loaded (amount varies), all other pictures fail. The reason is that my function does not respond correctly, and the firebase console shows me that my function threw an error:
The above image shows that
A request to the function (called "PostPictureView") suceeds
Afterwards, three requests to the controller fail
In the end, after executing a new request to the "UserLogin"-function, also that fails.
The response given to the client is the default "Error: Could not handle request". After waiting a few seconds, all requests get handled again as they are supposed to be.
My best guesses:
The project is on free tier, maybe google is throttling something? (I did not hit any limits afaik)
Is there a limit of messages the google firebase console can handle?
Could the tmpdir from the functions-app run low? I never delete the temporary files so far, but would expect that either google deletes them automatically, or warns me in a different way that the space is running low.
Does someone know an alternative way to receive the error messages, or has experienced similar issues? (As Firebase Functions is still in Beta, it could also be an error from google)
Btw: Downloading the image from the client (android app, react-native) directly is not possible, because I will use the function to check for access permissions later. The problem is reproducable for me.

In Cloud Functions, the /tmp directory is backed by memory. So, every file you download there is effectively taking up memory on the server instance that ran the function.
Cloud Functions may reuses server instances for repeated calls to the same function. This means your function is downloading another file (to that same instance) with each invocation. Since the names of the files are different each time, you are accumulating files in /tmp that each occupy memory.
At some point, this server instance is going to run out of memory with all these files in /tmp. This is bad.
It's a best practice to always clean up files after you're done with them. Better yet, if you can stream the file content from Cloud Storage to the client, you'll use even less memory (and be billed even less for the memory-hours you use).

After some more research, I've found the solution: The Firebase Console seems to not show all error information.
For detailed information to your functions, and errors that might be omitted in the Firebase Console, check out the website from google cloud functions.
There I saw: The memory (as suggested by #Doug Stevensson) usage never ran over 80MB (limit of 256MB) and never shut the server down. Moreover, there is a DNS resolution limit for the free tier, that my application hit.
The documentation points to a limit of DNS resolutions: 40,000 per 100 seconds. In my case, this limit was never hit - firebase counts a total executions of roundabout 8000 - but it seems there is a lower, undocumented limit for the free tier. After upgrading my account (I started the trial that GCP offers, so actually not paying anything) and linking the project to the billing account, everything works perfectly.

Umbraco 7.6.0 - Site becomes unresponsive for several minutes every day

We've been having a problem for several months where the site becomes completely unresponsive for 5-15 minutes every day. We have added a ton of request logging, enabled DEBUG logging, and have finally found a pattern: Approximately 2 minutes prior to the outages (in every single log file I've looked at, going back to the beginning), the following lines appear:
2017-09-26 15:13:05,652 [P7940/D9/T76] DEBUG
Umbraco.Web.PublishedCache.XmlPublishedCache.XmlCacheFilePersister -
Timer: release. 2017-09-26 15:13:05,652 [P7940/D9/T76] DEBUG
Umbraco.Web.PublishedCache.XmlPublishedCache.XmlCacheFilePersister -
Run now (sync).
From what I gather this is the process that rebuilds the umbraco.config, correct?
We have ~40,000 nodes, so I can't imagine this would be the quickest process to complete, however the strange thing is that the CPU and Memory on the Azure Web App do not spike during these outages. This would seem to point to the fact that the disk I/O is the bottleneck.
This raises a few questions:
Is there a way to schedule this task in a way that it only runs
during off-peak hours?
Are there performance improvements in the newer versions (we're on 7.6.0) that might improve this functionality?
Are there any other suggestions to help correct this behavior?
Hosting environment:
Azure App Service B2 (Basic)
SQL Azure Standard (20 DTUs) - DTU usage peaks at 20%, so I don't think there's anything there. Just noting for completeness
Azure Storage for media storage
Azure CDN for media requests
Thank you so much in advance.
Update 10/4/2017
If it helps, It appears that these particular log entries correspond with the first publish of the day.

I don't feel like 40,000 nodes is too much for Umbraco, but if you want to schedule republishes, you can do this:
You can programmatically call a cache refresh using:
ApplicationContext.Current.Services.ContentService.RePublishAll();
(Umbraco source)
You could create an API controller which you could call periodically by a URL. The controller would probably look something like:
public class CacheController : UmbracoApiController
{
[HttpGet]
public HttpResponseMessage Republish(string pass)
{
if (pass != "passcode")
{
return Request.CreateResponse(HttpStatusCode.Unauthorized, new
{
success = false,
message = "Access denied."
});
}
var result = Services.ContentService.RePublishAll();
if (result)
{
return Request.CreateResponse(HttpStatusCode.OK, new
{
success = true,
message = "Republished"
});
}
return Request.CreateResponse(HttpStatusCode.InternalServerError, new
{
success = false,
message = "An error occurred"
});
}
}
You could then periodically ping this URL:
/umbraco/api/cache/republish?code=passcode
I have a blog post on how you can read on how to schedule events like these to occur. I recommend just using the Windows Task Scheduler to ping the URL: https://harveywilliams.net/blog/better-task-scheduling-in-umbraco#windows-task-scheduler

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string