Firestore Query performance issue on Firebase Cloud Functions - node.js

I am facing timeout issues on a firebase https function so I decided to optimize each line of code and realized that a single query is taking about 10 seconds to complete.
let querySnapshot = await admin.firestore()
.collection("enrollment")
.get()
The enrollment collection has about 23k documents, totaling approximately 6MB.
To my understanding, since the https function is running on a cloud function stateless server, it should not suffer from the query result size. Both Firestore and Cloud Functions are running on the same region (us-central). Yet 10 seconds is indeed a high interval of time for executing such a simple query that results in a small snapshot size.
An interesting fact is that later in the code I update those 23k documents back with a new field using Bulk Writter and it takes less than 3 seconds to run bulkWriter.commit().
Another fact is that the https function is not returning any of the query results to the client, so there shouldn't be any "downloading" time affecting the function performance.
Why on earth does it take 3x longer to read values from a collection than writing to it? I always thought Firestore architecture was meant for apps with high reading rates rather than writing.
Is there anything you would propose to optimize this?

When we perform the get(), a query is created to all document snapshots and the results are returned. These results are fetched sequentially within a single execution, i.e. the list is returned and parsed sequentially until all documents have been listed.
While the data may be small, are there any subcollections? This may add some additional latency as the API fetches and parses subcollections.
Updating the fields with a bulk writer update is over 3x the speed because the bulkwriter operation is performed in parallel and is queued based upon Promises. This allows many more operations per second.
The best way to optimize listing all documents is summarised in this link, and Google’s recommendation follows the same guideline being to use an index for faster queries and to use multiple readers that fetch the documents in parallel.

Related

How to optimize mongodb insert query in nodejs?

We are doing manipulation and insertion of data in mongo db. So for single insert in mongo db it is taking 28ms. I have to insert 2 times per request. At a time, if I get 6000 requests, I have to insert each data individually and it takes lot more time. How can I optimize this? Kindly help me on this.
var obj = new gnModel({
id: data.EID,
val: data.MID,
});
let response = await insertIntoMongo(gnModel);
If it is not vital for the data to be stored immediately, you can implement some form of batching.
For example, you can have a service which queues operations and commits them to the database every X seconds. In the service itself, you can use mongo's Bulk and more specifically for insertion: Bulk.insert(). It lets you queue operations to be executed as a single query(or at least minimal amount of queries/round trips).
It would also be a good idea to serialize and store this operation log/cache somewhere, as server restart will wipe it out if it is stored entirely in memory. A possible solution is Redis as it can both persist data to disk and is also distributed thus enabling you to queue operations from different application instances.
You'll achieve even better performance if the operations are unrelated and not dependent on each other. In this case you can use db.collection.initializeUnorderedBulkOp() which will allow mongo to execute the operations in parallel instead of sequentially and a single operation fail won't affect the execution of the rest of the set(contrary to OrderedBulkOp).

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});
You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

DocumentDB: How to run a query without timing out

I am new to the documentDb. I wrote a stored procedure that checks all records and update them under certain circumstances.
Current scenario:
It would run 100 records at a time, updates them and after running few times( taking 100 records at a time and updating) it is timing out.
Expectation
Run the script on all the records without timing out.
The document has close to a million records. So, running the same script multiple times manually is not a the way I am looking for.
Can anyone please advise me how I can achieve that?
tl;dr; Keep calling the sproc with the query continuation token being passed back and forth.
A few thoughts:
There is no capacity of RUs for collections that will allow you to do all million in one call to the sproc.
Sprocs run in isolation on a single replica. This means that they can be transactional but their use will have lower throughput than a regular query that can use all replicas to satisfy the request, so unless you need it to be in a sproc, I recommend using direct queries for reads that don't need to be transactional with writes. Even then, with a million documents, your queries will max out and you'll have to run the query again with a continuation token.
If you must use a sproc... As you are probably aware since you have done the 100 at a time thing, each query returns a continuation token. You can actually add that to the package that you send back from your sproc when it times out. Then you can pass that back into another call to the same sproc and write your sproc to pick up where you left off. The documentdb-utils library for node.js automatically re-calls the sproc until done as long as you follow this pattern for writing your sprocs. If you are using node.js, you could use that (but it has not yet been upgraded to support partitioned collections) or you could write the equivalent in whatever platform you are using.

neo4j graph import slower when using transactional end point

I have an application which imports user profile and social data on to a graph. My app importer is a nodejs app. The first pass of my importer used node-neo4j and async cypher queries to import the data. I combined this with the Q promise library to string together thousands of queries.
My second pass was an attempt to use the transactional REST endpoint: /db/data/transaction/commit and a single JSON document containing 5000 transactions.
What I'm seeing is that the the first approach completed in 15 seconds while the second approach (which I expect is thousands of fewer HTTP calls) actually takes 30 seconds to complete. I'm at a loss for how the second approach could be twice as slow.
Can anyone shed any light on this?
Transactional commits trade memory and ordered operations for performance.
If you're looking for speedy imports, I recommend LOAD CSV.

CloudTable.ExecuteQuery : how to get number of transactions

AFAIK the ExecuteQuery handles segmented queries internally. And every request (=call to storage) for a segment counts as a transaction to the storage (I mean billing transaction - http://www.windowsazure.com/en-us/pricing/details/storage/). Right?
Is there a way to understand in how many segments is splitted a query? If I run a query and I get 5000 items, I can suppose that my query was splitted into 5 segments (due to the limit of 1000 items per segment). But in case of a complex query there is also the timeout of 5 seconds per call.
I don't believe there's a way to get at that in the API. You could set up an HTTPS proxy to log the requests, if you just want to check it in development.
If it's really important that you know, use the BeginExecuteSegmented and EndExeceuteSegmented calls instead. Your code will get a callback for each segment, so you can easily track how many calls there are.

Resources