Cloud Firestore big data error - Deadline Exceeded [duplicate] - node.js

I would like to load collection that is ~30k records. I.e load it via.
const db = admin.firestore();
let documentsArray: Array<{}> = [];
db.collection(collection)
.get()
.then(snap => {
snap.forEach(doc => {
documentsArray.push(doc);
});
})
.catch(err => console.log(err));
This will always throw Deadline Exceeded error. I have searched for some sorts of mechanism that will allow me to paginate trough it but I find it unbelievable not to be able to query for not that big amount in one go.
I was thinking that it may be that due to my rather slow machine I was hitting the limit but then I deployed simple express app that would do the fetching to app engine and still had no luck.
Alternatively I could also export the collection with gcloud beta firestore export but it does not provide JSON data.

I'm not sure about firestore, but on datastore i was never able to fetch that much data in one shot, I'd always have fetch pages of about 1000 records at a time and build it up in memory before processing it. You said:
I have searched for some sorts of mechanism that will allow me to paginate trough
Perhaps you missed this page
https://cloud.google.com/firestore/docs/query-data/query-cursors

In the end the issue was that machine that was processing the 30k records from the Firestore was not powerful enough to get the data needed in time. Solved by using, GCE with n1-standard-4 GCE.

Related

How to create a Flutter Stream using MongoDB (watch collection?) with Firebase Cloud Function

I've been trying out MongoDB as database for my Flutter project lately, since I want to migrate from pure Firebase database (some limitations in Firebase are an issue for my project, like the "in-array" limit of 10 for queries).
I already made some CRUD operations methods in some Firebase Cloud Functions, using MongoDB. I'm now able to save data and display it as a Future in a Flutter App (a simple ListView of Users in a FutureBuilder).
My question is : how would it be possible to create a StreamBuilder thanks to MongoDB and Firebase Cloud Functions ? I saw some stuff about watch collection and Stream change but nothing clear enough for me (usually I read a lot of examples or tutorial to understand).
Maybe some of you would have some clues or maybe tutorial that I can read/watch to learn a little bit more about that subject ?
For now, I have this as an example (NodeJS Cloud Function stored in Firebase), which obviously produces a Future in my Future app (not realtime) :
exports.getUsers = functions.https.onCall(async (data, context) => {
const uri = "mongodb+srv://....";
const client = new MongoClient(uri);
await client.connect();
var results = await client.db("myDB").collection("user").find({}).toArray();
await client.close();
return results;
});
What would you advice me to obtain a Stream instead of a Future, using maybe watch collection and Stream change from MongoDB, providing example if possible !
Thank you very much !
Cloud Functions are meant for short-lived operations, not for long-term listeners. It is not possible to create long-lived connections from Cloud Functions, neither to other services (such as you're trying to do to MongoDB here) nor from Cloud Functions back to the calling client.
Also see:
If I implement onSnapshot real-time listener to Firestore in Cloud Function will it cost more?
Can a Firestore query listener "listen" to a cloud function?
the documentation on EventArc, which is the platform that allows you build custom triggers. It'll be (a lot* more. involved though.

Why Firebase Firestore count reads operations while I am just adding new documents only?

I have created and an API endpoint with Firebase Functions usign node.js. This
API endpoint collect JSON data from client browser and I am saving that JSON data to Firebase Firestore databse using Firebase Functions.
While this works fine but when I see Firestore usage tab it's shows really high number of read operations even I have not created any read function till now.
My API is in Production and and current usage data is : Reads 9.7K, Writes 1K, Deletes 0.
I have already tried to check with Firebase Firestore Document and Pricing but never seems to find anything on this issue.
I am using Firestore add function to create document with an auto generated document id. ValidateSubscriberData() is a simple function to validate client req.body inputs which is JSON data.
app.post('/subscribe', (req, res) => {
let subscriber = {};
ValidateSubscriberData(req.body)
.then(data => {
subscriber = data;
//console.log(data);
subscriber.time = Date.now();
return subscriber;
})
.then(subscriber => {
//console.log(subscriber);
// noinspection JSCheckFunctionSignatures
return db.collection(subscriber.host).add(subscriber);
})
.then(document => {
console.log(document.id);
res.json({id: document.id, iid: subscriber.iid});
return 0;
})
.catch(error => {
console.log({SelfError: error});
res.json(error);
});
});
I don't know this is an issue with Firestore or I am doing something in a way that makes read operations internally but I want find a way so I can optimize my code.
English is not my first language and I am trying my best explain my issue.
I think Firestore is working perfectly fine and my code too. I assume Firebase is counting those reads which I made through Firebase Console.
To verify this I have clicked on Data tab on Firestore page and scroll down to make all document name/id visible. And after that I see 1K Reads added on my old stats. So its proven Firestore counting all reads even its from firebase console and it is obvious but my bad I have not thinking about this before.
I don't think this question has any relevance but may be people like me find it helpful before posting any silly question on this helpful platform.

Why does a simple SQL query causes significant slow down in my Lambda function?

I built a basic node.js API hosted on AWS Lambda and served over AWS API Gateway. This is the code:
'use strict';
// Require and initialize outside of your main handler
const mysql = require('serverless-mysql')({
config: {
host : process.env.ENDPOINT,
database : process.env.DATABASE,
user : process.env.USERNAME,
password : process.env.PASSWORD
}
});
// Import the Dialogflow module from the Actions on Google client library.
const {dialogflow} = require('actions-on-google');
// Instantiate the Dialogflow client.
const app = dialogflow({debug: true});
// Handle the Dialogflow intent named 'trip name'.
// The intent collects a parameter named 'tripName'.
app.intent('trip name', async (conv, {tripName}) => {
// Run your query
let results = await mysql.query('SELECT * FROM tablename where field = ? limit 1', tripName)
// Respond with the user's lucky number and end the conversation.
conv.close('Your lucky number is ' + results[0].id);
// Run clean up function
await mysql.end()
});
// Set the DialogflowApp object to handle the HTTPS POST request.
exports.fulfillment = app;
It receives a parameter (a trip name), looks it up in MySQL and returns the result.
My issue is that the API takes more than 5 seconds to respond which is slow.
I'm not sure why it's slow? the MySQL is a powerful Amazon Aurora and node.js is supposed to be fast.
I tested the function from the same AWS region as the MySQL (Mumbai) and it still times out so I don't think it has to do with distance between different regions.
The reason of slowness is carrying out any SQL query (even a dead simple SELECT). It does bring back the correct result but slowly.
When I remove the SQL part it becomes blazing fast. I increased the memory for Lambda to the maximum and redeployed Aurora to a far more powerful one.
Lambda functions will run faster if you configure more memory. The less the memory configured, the worse the performance.
This means if you have configured your function to use 128MB, it's going to be run in a very low profile hardware.
On the other hand, if you configure it to use 3GB, it will run in a very decent machine.
At 1792MB, your function will run in a hardware with a dedicated core, which will speed up your code significantly considering you are making use of IO calls (network requests, for example). You can see this information here
There's no magic formula though. You have to run a few tests and see what memory configuration best suits your application. I would start with 3GB and eventually decrease it by chunks of 128MB until you find the right configuration.

Google cloud datastore slow (>800ms) with simple query from compute engine

When I try to query the Google Cloud Datastore from a (micro) compute engine, it usually takes >800ms to get a reply. The best I got was 450ms, the worst was >3 seconds.
I was under the impression that latency should be much, much lower (like 20-80ms), so I'm guessing I'm doing something wrong.
This is the (node.js) code I'm using to query (from a simple datastore with just a single entity):
const Datastore = require('#google-cloud/datastore');
const projectId = '<my-project-id>';
const datastoreClient = Datastore({
projectId: projectId
});
var query = datastoreClient.createQuery('Test').limit(1);
console.time('query');
query.run(function (err, test) {
if (err) {
console.log(err);
return;
}
console.timeEnd('query');
});
Not sure if it's relevant, but my app-engine project is in the US-Central region, as is the compute engine I'm running the query from.
UPDATE
After some more testing I found out that the default authentication (token?) that you get when using the Node.js library provided by Google expires after about 4 minutes.
So in other words: if you use the same process, but you wait 4 minutes or more between requests, query times are back to >800ms.
I also tried authenticating using a keyfile, and that seemed to do better: subsequent request are still faster, but the initial request only takes half the time (>400ms).
This latency that you see for your initial requests to the Datastore are most likely due to caching being warmed up. The Datastore uses a distributed architecture to manage scaling, which allows your queries to scale with the size of your result set. By performing more of the same query, the better prepared the Datastore is to serve your query, and the more consistent the speeds of your results.
If you want similar result speeds on low Datastore access rates, it is recommended to configure your own caching layer. Google App Engine provides Memcache which is optimized for use with the Datastore. Since you are making requests from Compute Engine, you can use other third-party solutions such as Redis or Memcached.

How do I use the foreach method in MongoDB to do scraping/API calls without getting blacklisted by sites?

I have about 20 documents currently in my collection (and I'm planning to add many more probably in the 100s). I'm using the MongoDB Node.js clients collection.foreach() method to iterate through each one and based on the document records go to 3 different endpoints: two APIs (Walmart and Amazon) and one website scrape (name not relevant). Each document contains the relevant data to execute the requests and then I update the documents with the returned data.
The problem I'm encountering is the Walmart API and the website scrape will not return data toward the end of the iteration. Or at least my database is not getting updated. My assumption is that the foreach method is firing off a bunch of simultaneous requests and either I'm bumping up against some arbitrary limit of simultaneous requests allowed by the endpoint or the endpoints simply can't handle this many requests and ignore anything above and beyond its "request capacity." I've ran some of the documents that were not updating through the same code but in a different collection that contained just a single document and they did update so I don't think it's bad data inside the document.
I'm running this on Heroku (and locally for testing) using Node.js. Results are similar both on Heroku instance and locally.
If my assumption is correct I need a better way to structure this so that there is some separation between requests or maybe it only does x records on a single pass.
It sounds like you need to throttle your outgoing web requests. There's a fantastic node module for doing this called limiter. The code looks like this:
var RateLimiter = require('limiter').RateLimiter;
var limiter = new RateLimiter(1, 1000);
var throttledRequest = function() {
limiter.removeTokens(1, function() {
console.log('Only prints once per second');
});
};
throttledRequest();
throttledRequest();
throttledRequest();

Resources